What happens when you have a bunch of many-to-many relationships in your data?

Imagine building a social network:

  • users can follow other users
  • users can like posts
  • posts can tag other users
  • users can comment
  • comments can have replies
  • etc

If we used a strict relational database, we’d need to perform joins to answer certain questions:

Find users who liked a post that mentioned someone they follow

But with graph databases, the queries look something like this:

MATCH (a:User)-[:FOLLOWS]->(b:User)<-[:MENTIONED_IN]-(p:Post)<-[:LIKED]-(a)
RETURN a

Equal alternative:

MATCH (a:User)-[:FOLLOWS]->(b:User),
      (a)-[:LIKED]->(p:Post),
      (p)-[:MENTIONED_IN]->(b)
RETURN a

How to read this?

  • Match: keyword indicates start of query
  • (a:User)-[:FOLLOWS]->(b:User):
    • Define variable a that has type User
    • Define variable b that has type User
    • Specifically find the path such that between a and b, there MUST be an directional edge that has label FOLLOWS
    • AKA, this gives us the the follower and the followee of a path
  • Return: Specifically return back the a node, and nothing else

The flexibility comes from the fact that we can connect our nodes with as many edges/relationships as we’d like.

How is Data Stored on Disk?

https://neo4j.com/developer/kb/understanding-data-on-disk/

  1. There’s a node store
  2. There’s a relationship store (edges)
  3. There’s (multiple) property stores (key values stored on nodes / edges)

Likely very similar to how the book describes the underlying data storage.

From the above link:

“Data stored on disk is all linked lists of fixed size records. Properties are stored as a linked list of property records, each holding a key and value and pointing to the next property. Each node and relationship references its first property record. The Nodes also reference the first relationship in its relationship chain. Each Relationship references its start and end node. It also references the previous and next relationship record for the start and end node respectively.”

RDF demo

https://atomgraph.github.io/SPARQL-Playground/

PREFIX : <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX ddia: <http://ddia-reading-group.com#>

ddia:Kaushal a :Person ;
a :Guy ;
:livesIn :California .

ddia:Gage a :Person ;
a :CoolGuy .
PREFIX : <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX ddia: <http://ddia-reading-group.com#>

SELECT ?s 
{
    ?s :livesIn :California
}
LIMIT 100
PREFIX : <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX ddia: <http://ddia-reading-group.com#>

SELECT ?s
WHERE {
  ?s ?p ?o .
  FILTER regex(str(?o), "Guy", "i")
}

How is RDF Data stored on disk?

note, this is just one implementation: https://jena.apache.org/documentation/tdb/architecture.html

A dataset backed by TDB is stored in a single directory in the filing system. A dataset consists of

  • The node table
  • Triple and Quad indexes
  • The prefixes table

Pretty similar to the graph db. It’s important to note no edge table exists here because everything is flattened out into a single type of structure.

What are Quad indexes?

In Triple stores, things are actually quadruplets:

(subject, verb, object) (graph, subject, verb, object)

if graph is omitted, it is assumed that the query is going to the “default graph”. If graph is specified, the query is run against a “named graph.”