Notes on Graph Databases

What happens when you have a bunch of many-to-many relationships in your data?

Imagine building a social network:

users can follow other users
users can like posts
posts can tag other users
users can comment
comments can have replies
etc

If we used a strict relational database, we’d need to perform joins to answer certain questions:

Find users who liked a post that mentioned someone they follow

But with graph databases, the queries look something like this:

MATCH (a:User)-[:FOLLOWS]->(b:User)<-[:MENTIONED_IN]-(p:Post)<-[:LIKED]-(a)
RETURN a

Equal alternative:

MATCH (a:User)-[:FOLLOWS]->(b:User),
      (a)-[:LIKED]->(p:Post),
      (p)-[:MENTIONED_IN]->(b)
RETURN a

How to read this?

Match: keyword indicates start of query
(a:User)-[:FOLLOWS]->(b:User):
- Define variable a that has type User
- Define variable b that has type User
- Specifically find the path such that between a and b, there MUST be an directional edge that has label FOLLOWS
- AKA, this gives us the the follower and the followee of a path
…
Return: Specifically return back the a node, and nothing else

The flexibility comes from the fact that we can connect our nodes with as many edges/relationships as we’d like.

How is Data Stored on Disk?

https://neo4j.com/developer/kb/understanding-data-on-disk/

There’s a node store
There’s a relationship store (edges)
There’s (multiple) property stores (key values stored on nodes / edges)

Likely very similar to how the book describes the underlying data storage.

From the above link:

“Data stored on disk is all linked lists of fixed size records. Properties are stored as a linked list of property records, each holding a key and value and pointing to the next property. Each node and relationship references its first property record. The Nodes also reference the first relationship in its relationship chain. Each Relationship references its start and end node. It also references the previous and next relationship record for the start and end node respectively.”

RDF demo

https://atomgraph.github.io/SPARQL-Playground/

PREFIX : <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX ddia: <http://ddia-reading-group.com#>

ddia:Kaushal a :Person ;
a :Guy ;
:livesIn :California .

ddia:Gage a :Person ;
a :CoolGuy .

PREFIX : <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX ddia: <http://ddia-reading-group.com#>

SELECT ?s 
{
    ?s :livesIn :California
}
LIMIT 100

PREFIX : <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX ddia: <http://ddia-reading-group.com#>

SELECT ?s
WHERE {
  ?s ?p ?o .
  FILTER regex(str(?o), "Guy", "i")
}

How is RDF Data stored on disk?

note, this is just one implementation: https://jena.apache.org/documentation/tdb/architecture.html

A dataset backed by TDB is stored in a single directory in the filing system. A dataset consists of

The node table

Triple and Quad indexes

The prefixes table

Pretty similar to the graph db. It’s important to note no edge table exists here because everything is flattened out into a single type of structure.

What are Quad indexes?

In Triple stores, things are actually quadruplets:

(subject, verb, object) → (graph, subject, verb, object)

if graph is omitted, it is assumed that the query is going to the “default graph”. If graph is specified, the query is run against a “named graph.”

KP's Thoughts

Explorer

Explorer

Recent Notes

Recurse Center -- Day 28

Recurse Center -- Day 27

Recurse Center -- Day 26