The Open Anzo system uses a relational database as its backing store. There are 2 database layouts used within the Open Anzo system. The first, and the simpler of the 2 layouts, is used for basic quad storage, where a quad is made up of a Subject, Predicate, Object, and NamedGraph URI. The basic quad storage layout is used for the client persistence layer when in client server mode and also when using Open Anzo as just a basic quad store. The second layout is used for the server's storage and functions as a temporal store. Both of these layouts are built upon a Node Centric Data layout.
Node Centric Data Layout
The storage of basic RDF primitives, those being URIs, Literals,XML Literals, and BlankNodes?, are stored using the node centric data layout. The key to the node centric layout is the way nodes are given their IDs. A set of sequences, one for each of the node types, are used to generate unique IDs for the values being stored. Each of these sequences is defined by a 4 bit mask value, which allows the layout to have 16 possible node types. This mask makes up the 4 most significant bits of the 64 bit long values that the sequences provide. When the sequences are defined, their initial and maximum values are defined by the mask, so by taking the 4 most significant bits of any ID in the system, you can determine the type of node that the ID represents.
Example: BlankNodes have a mask value of 0. The sequence for BlankNode ids is defined as a sequence of long values, starting at 0, and having a maximum value of 1152921504606846975. URIs have a mask value of 1. The sequence for URIs ids is defined as a sequence of long values, starting at 1152921504606846977, and having a maximum value of 2305843009213693951. If an ID is seen, say 1152921504606846978, to determine the type, you bit shift the 4 most significant bits 60 bits to the right, using the unsigned right shift operator >>>: in java (1152921504606846977 >>> 60)). This would gain a result of 1, which would mean the ID was that of a URI.
The data for the node values are stored across a set of tables based on the type of value being stored as well as the length of the data being stored. One set of tables is used for values whose length is less than a maximum length defined for the database. The reason for having this maximum length is to allow for smaller values to indexed, and thus allowing for faster queries. For storing values that exceed the maximum length of an indexable value, a hash is computed for the value, and that is stored in the table along side the id and value. Lookups can then use the hash value to narrow down the results, and process this smaller set of results to find the desired value.
Example: If your value is less than the maximum length of an indexable value, within the SQL you can do a query like SELECT ID FROM URIS WHERE VALUE='test'. This query should return only one value, since the values within a table are unique. If the value is longer than the maximum length, within the SQL you would need a query like SELECT ID FROM URIS_L WHERE HASH='1235'. This query could return zero or more values, and the system would then process through these results locally in order to find a value that matches.
The main reasons for using this node centric layout is reducing storage size of data in the database and increased performance. The system only needs to store the node values once within the database, and simply reuse the ID if the value is used in multiple statements. It also helps in doing joins, since you can do queries and joins against a set of long values, and not large VARCHARs.
Basic Quad Layout
In a basic quad layout, the statements table would be defined like this:
CREATE TABLE STATEMENTS (
NAMEDGRAPHID BIGINT NOT NULL,
SUBJ BIGINT NOT NULL,
PROP BIGINT NOT NULL,
OBJ BIGINT NOT NULL
)
When a statement is added to the table, the IDs for the 4 nodes are used in place of the actual node values. A unique constraint is added to the table to prevent a statement from being in the table more than once.
Server's Temporal Layout
When dealing with data on the server, one of the requirements is for the server to store a full history of changes to the data within the store. This means you must be able to determine the contents of the store at any point in time. To accomplish this, Open Anzo uses a temporal data layout, which means there is a mechanism to represent the span time that the data within the system is valid. To accomplish this, 2 extra columns are added to the tables, so for a basic statements table, the definition would look like this:
CREATE TABLE STATEMENTS (
NAMEDGRAPHID BIGINT NOT NULL,
SUBJ BIGINT NOT NULL,
PROP BIGINT NOT NULL,
OBJ BIGINT NOT NULL,
HSTART BIGINT NOT NULL,
HEND BIGINT
)
The HSTART, or HistoryStart, column stores the timestamp when the statement was added. The HEND, or HistoryEnd, columns stores the timestamp when the statement was deleted. If the HEND column is null, it means that the statement is still valid. When a statement is deleted, instead of removing the row that contains the statement, the database simply updates the rows HEND column to signify that the row is no longer valid.
Example query for live data: SELECT * from STATEMENTS WHERE HEND IS NULL Example query for data that is deleted: SELECT * FROM STATEMENTS WHERE HEND NOT NULL Example query for data that was valid at timestamp 1190121412525(epoc time): SELECT * FROM STATEMENTS WHERE HSTART<=1190121412525 AND (HEND IS NULL OR HEND > 1190121412525)
Further Topics
- Template Files? : A description of the template system used to generate the SQL files used in Open Anzo.
- Quad Layout Tables? : A description of the different tables used within the basic quad layout.
- Server Tables? : A description of the different tables used within the server layout.


