Chapter 8: Graph Databases

Summary

This chapter introduces graph databases, which model data as nodes and relationships rather than tables or documents, making them uniquely suited for highly connected data such as social networks, knowledge graphs, fraud detection, and recommendation engines. Students learn both the property graph model (Cypher, Neo4j) and the RDF model (SPARQL, Amazon Neptune), core traversal algorithms, and how distributed graph databases such as TigerGraph scale graph processing across a cluster. The chapter concludes with knowledge graphs as a unifying pattern for enterprise data integration.

Concepts Covered

This chapter covers the following 15 concepts from the learning graph:

Property Graph Model
RDF Graph Model
SPARQL
Cypher Query Language
Graph Traversal Algorithm
Depth-First Search
Breadth-First Search
Shortest Path Algorithm
Neo4j
TigerGraph
Amazon Neptune
Distributed Graph Database
Graph Partitioning
GSQL
Knowledge Graph

Prerequisites

This chapter builds on concepts from:

Welcome to the World Where Relationships Are First-Class Citizens

Dex stands at the center of a web of connected nodes Most databases treat relationships as an afterthought — foreign keys bolted onto the side of tables, joins computed at query time from scratch on every request. Graph databases flip this completely: the relationship itself is stored as a first-class data structure with its own properties and its own identity. If your core business question involves "find me everything connected to this thing, and the things connected to those things," graph databases are where you belong. Chapter 8 explains why.

The Relational Model's Relationship Problem

Consider a social network with 100 million users. You want to answer: "Who are the friends of Alice's friends who live in the same city as Alice?" In a relational database, you write a multi-table self-join: friends JOIN friends JOIN users WHERE city = ?. The query planner must execute those joins from scratch on every request, traversing tens of millions of rows per hop. Add a third hop — "friends of friends of friends" — and query time explodes.

The root cause is that relational databases store relationships implicitly, as matching values in two columns (a foreign key and a primary key). Finding the relationship requires computation at query time. As the hop count grows, the computation cost grows exponentially.

Graph databases store relationships explicitly as edges in the graph structure. An edge between Alice and Bob is a stored object, not a computed join. Following Alice's friendships is a traversal — pointer chasing through stored edges — rather than a join — hash-table matching. For densely connected data with many hops, this architectural difference produces orders-of-magnitude performance differences.

The Property Graph Model

The property graph model is the dominant graph data model for application databases. It defines four fundamental elements:

Nodes represent entities — the "things" in your domain. A node in a social graph might represent a Person, a Company, or a City. Each node has one or more labels that describe its type.

Edges (also called relationships) represent connections between nodes. An edge has a mandatory direction (from one node to another), a mandatory type (a single label like KNOWS, WORKS_AT, or LOCATED_IN), and optionally properties. Critically, edges in the property graph model are stored objects with their own identity — not derived from matching keys.

Properties are key-value pairs attached to both nodes and edges. A Person node might have properties name: "Alice", age: 34, and city: "Seattle". A KNOWS edge might have properties since: 2019 and met_at: "PyCon". The ability to attach properties to edges is what distinguishes the property graph model from simpler graph models where edges are bare connections.

Labels categorize nodes and edge types. You can filter queries to all nodes with a given label, or traverse only edges of a specific type.

The following list summarizes the property graph model's key structural rules:

Nodes have zero or more labels (e.g., Person, Employee)
Edges have exactly one type (e.g., KNOWS, MANAGES)
Edges have a direction but can be traversed in either direction in queries
Both nodes and edges can have arbitrary key-value properties
There is no predefined schema — node and edge structures evolve freely

Diagram: Property Graph Explorer

Interactive property graph: clickable nodes and edges with property panels (vis-network)

**sim-id:** property-graph-explorer **Library:** vis-network **Status:** Specified A vis-network canvas (900×500 px) displaying a small social/professional graph with 8 nodes and 12 edges. **Node types and colors:** - Person nodes: `#4682B4` (steel blue), label = person name - Company nodes: `#E65100` (burnt orange), label = company name - City nodes: `#28a745` (green), label = city name **Sample graph:** - Alice (Person) — KNOWS → Bob (Person), since=2019 - Alice (Person) — WORKS_AT → Acme Corp (Company), since=2021, role="Architect" - Alice (Person) — LIVES_IN → Seattle (City) - Bob (Person) — WORKS_AT → Globex (Company), since=2020, role="Engineer" - Bob (Person) — LIVES_IN → Portland (City) - Carol (Person) — KNOWS → Alice (Person), since=2018 - Carol (Person) — LIVES_IN → Seattle (City) - Dave (Person) — KNOWS → Bob (Person), since=2022 - Dave (Person) — WORKS_AT → Acme Corp (Company), since=2022, role="Manager" - Acme Corp — LOCATED_IN → Seattle (City) - Globex — LOCATED_IN → Portland (City) **Interaction:** - Clicking any node opens a properties panel on the right: node label(s), all properties as a key-value list, count of connected edges - Clicking any edge opens a properties panel: edge type, direction arrow, all edge properties - Hovering a node highlights its direct neighbors in yellow - A "Highlight Path" button: clicking two nodes highlights the shortest path between them in gold **Learning objective:** Understanding (Bloom's) — students explore the structure of a property graph by clicking through nodes and edges, internalizing what properties-on-edges enables compared to a simple adjacency list. Responsive: canvas resizes on window resize event via vis-network's `fit()` method.

The RDF Graph Model and SPARQL

Alongside the property graph model, a second graph model serves a different community: the RDF graph model (Resource Description Framework). RDF originates from the World Wide Web Consortium (W3C) and underlies the Semantic Web initiative. Where property graphs are designed for application developers building connected data applications, RDF is designed for data interoperability, ontology-driven reasoning, and linked open data.

In RDF, all data is expressed as triples: a subject, a predicate, and an object. Subject and predicate are both URIs (Uniform Resource Identifiers) — globally unique identifiers that prevent naming conflicts across data sources. The object is either a URI (pointing to another resource) or a literal value (a string, number, or date).

For example:

<https://example.org/person/alice>  <https://schema.org/knows>  <https://example.org/person/bob>
<https://example.org/person/alice>  <https://schema.org/name>   "Alice"
<https://example.org/person/alice>  <https://schema.org/age>    "34"^^xsd:integer

The triple model is maximally flexible — any fact about anything can be expressed as a triple — but it is also verbose and lacks the ability to attach properties directly to edges (a named graph extension partially addresses this). RDF's strength is interoperability: because all identifiers are globally unique URIs, data from two independent RDF datasets can be merged without namespace collisions.

SPARQL (SPARQL Protocol and RDF Query Language) is the W3C standard query language for RDF data. It uses a pattern-matching syntax where query patterns look like RDF triples with variables substituted for unknowns. A query to find all people Alice knows who are older than 30 looks like this:

SELECT ?name ?age
WHERE {
  <https://example.org/person/alice> <https://schema.org/knows> ?person .
  ?person <https://schema.org/name> ?name .
  ?person <https://schema.org/age> ?age .
  FILTER (?age > 30)
}

The variables (?person, ?name, ?age) are bound by pattern matching against the triple store. SPARQL also supports graph pattern composition, aggregation, and federated queries across multiple SPARQL endpoints.

Cypher: ASCII-Art Graph Queries

For property graph databases — Neo4j in particular — the query language of choice is Cypher. Cypher's most distinctive feature is its ASCII-art syntax for expressing graph patterns: nodes appear as parentheses (n), edges as arrows --[r]->, and patterns chain them together naturally.

The following Cypher query finds all people Alice knows who work at the same company as Alice:

MATCH (alice:Person {name: "Alice"})-[:KNOWS]->(friend:Person),
      (alice)-[:WORKS_AT]->(company:Company)<-[:WORKS_AT]-(friend)
RETURN friend.name, company.name

The MATCH clause describes the graph pattern to find. The RETURN clause specifies what to include in the result. The ASCII art makes the query's intent legible to anyone who understands the property graph model — no manual joins, no subqueries, no GROUP BY gymnastics.

Cypher also supports mutation, path functions, and graph algorithms. The SHORTEST PATH function finds the shortest path between two nodes:

MATCH path = shortestPath(
  (alice:Person {name: "Alice"})-[:KNOWS*]-(bob:Person {name: "Bob"})
)
RETURN [n IN nodes(path) | n.name] AS connectionChain

The [:KNOWS*] syntax means "follow KNOWS edges any number of hops" — a variable-depth traversal that would require recursive CTEs in SQL.

Graph Traversal Algorithms: BFS and DFS

Two fundamental algorithms underpin most graph database operations. Both are graph traversal algorithms — systematic procedures for visiting nodes by following edges — but they differ in the order in which they explore the graph.

Depth-First Search (DFS) starts at a source node and follows edges as deep as possible before backtracking. It uses a stack (or recursion) to remember where to return after exhausting a branch. DFS is well-suited for problems that require exploring a complete path before evaluating it — such as cycle detection, topological sorting, and maze solving. In a social graph, DFS would plunge all the way down one friend-of-friend chain before exploring another.

Breadth-First Search (BFS) starts at a source node and visits all immediate neighbors before visiting any neighbor's neighbors. It uses a queue to track the frontier. BFS is the algorithm of choice for shortest path problems on unweighted graphs, because the first time BFS reaches a destination node, it has taken the minimum number of hops. In a social graph, BFS reveals the "degree of separation" between two people efficiently.

Dijkstra's algorithm extends BFS to weighted graphs: it finds the shortest path by cumulative edge weight rather than hop count, using a priority queue instead of a plain queue. In a routing graph where edges carry travel times, Dijkstra finds the fastest route.

Diagram: BFS vs. DFS Traversal Animation

Interactive BFS vs. DFS traversal: same graph, step-by-step comparison of traversal order

**sim-id:** bfs-dfs-traversal **Library:** p5.js **Status:** Specified A p5.js canvas (900×550 px) split into two panels, each showing the same 10-node graph. **Graph structure (fixed layout):** - Root node (node 1) at top center - Two children (nodes 2, 3) at level 1 - Nodes 4, 5 are children of node 2; nodes 6, 7 are children of node 3 - Nodes 8, 9 are children of node 4; node 10 is a child of node 6 - All edges are undirected for traversal purposes **Left panel — "BFS":** - Nodes are numbered by discovery order as BFS proceeds - Current frontier nodes highlighted in amber - Visited nodes shown in green with their BFS visit number - A "Visit Queue" list on the left side shows the FIFO queue state **Right panel — "DFS":** - Same graph, same coloring scheme - A "Visit Stack" list shows the LIFO stack state **Controls (shared):** - "Step" button: advances one node visit in both panels simultaneously - "Auto Play" button: animates at 1 step/second - Speed slider (0.5× to 3×) - "Reset" button: restores all nodes to unvisited state **Below both panels:** A running log shows the visit order sequence for each algorithm: "BFS order: 1, 2, 3, 4, 5, 6, 7..." vs "DFS order: 1, 2, 4, 8, 9, 5, 3..." **Learning objective:** Understanding (Bloom's) — students observe concretely that BFS explores level-by-level (useful for shortest paths) while DFS goes deep-first (useful for exhaustive path search). Responsive: canvas scales to container width on resize.

Neo4j: The Dominant Property Graph Database

Neo4j is the most widely deployed property graph database. Launched in 2007, it pioneered native graph storage — a storage engine specifically designed for connected data — rather than mapping a graph onto a relational or document storage layer.

Key architectural characteristics of Neo4j:

Native graph storage: Each node and relationship is stored as a fixed-size record containing direct pointers to adjacent records. Traversing one hop is a constant-time pointer dereference, not a join computation. This is why Neo4j's traversal performance is consistent regardless of total database size.
Cypher query language: The declarative ASCII-art syntax described above.
ACID transactions: Neo4j provides full ACID guarantees — all writes are transactional and durable.
Graph Data Science library: Built-in implementations of community detection, centrality, similarity, and pathfinding algorithms, executable directly from Cypher or the GDS API.
Neo4j Aura: The managed cloud offering, running on GCP.

Neo4j is a strong fit for knowledge graphs, fraud detection networks, recommendation engines, master data management, and identity access management graphs — any domain where the core queries follow connections rather than filter rows.

TigerGraph: Distributed Graph at Scale

TigerGraph addresses a limitation of Neo4j: graph databases that store the entire graph on a single machine (or a single leader) hit scaling walls when the graph contains billions of nodes and edges. TigerGraph is designed from the ground up for distributed graph processing, partitioning the graph across a cluster of machines and executing traversals in parallel across partitions.

TigerGraph's query language is GSQL (Graph Structured Query Language), a Turing-complete language designed for both operational graph queries and large-scale graph analytics. GSQL supports accumulator variables — data structures that collect results during traversal — which enable efficient computation of graph algorithm outputs such as PageRank, community detection, and triangle counting. A GSQL query to compute single-source shortest paths can process graphs with hundreds of millions of edges in seconds, running natively on distributed hardware.

The primary use case for TigerGraph is real-time deep-link analytics: finding patterns that require following 3, 4, or 5 hops across massive graphs — fraud ring detection, supply chain tracing, anti-money-laundering pattern matching — where single-machine graph databases either run out of memory or take too long to be operationally useful.

Amazon Neptune: Managed Multi-Model Graph

Amazon Neptune is Amazon Web Services' fully managed graph database service. Its distinguishing feature is dual-model support: Neptune stores a single graph that can be queried through either the property graph model (using Gremlin, a graph traversal language, or openCypher) or the RDF model (using SPARQL). Teams that need both a connected data application and an RDF knowledge base can operate both query interfaces against the same data store.

Neptune handles the operational concerns of running a graph database — replication, failover, storage auto-scaling, and read replicas — without requiring teams to manage the infrastructure. The tradeoff is less performance tuning flexibility compared to a self-managed Neo4j or TigerGraph cluster.

Distributed Graph Databases: The Partitioning Challenge

Distributing a graph across multiple machines — a distributed graph database — introduces a challenge that does not exist for distributed key-value or document stores: graph partitioning.

In a key-value store, each key maps to exactly one partition via a hash function. A read or write touches one machine. In a graph, an edge connects two nodes that may reside on different machines. A multi-hop traversal starting at node A (on machine 1) may immediately need to follow an edge to node B (on machine 2), then follow another edge to node C (on machine 3). Each hop becomes a network call.

Graph partitioning — the problem of deciding which nodes go on which machine — is NP-hard in general. The goal is to minimize the number of edges that cross partition boundaries, because cross-partition edges become expensive network traversals. Common heuristics include:

Hash partitioning: Hash the node ID to assign it to a machine. Simple and balanced, but completely ignores graph topology — 50% or more of edges may cross partitions.
Community-aware partitioning: Use a graph community detection algorithm to identify densely connected subgraphs and co-locate them on the same machine. Dramatically reduces cross-partition edges but requires an offline partitioning pass before data loading.
Streaming partitioning: Assign each new node as it arrives, placing it on the machine that already holds the most of its neighbors. Adapts to graph growth without full re-partitioning.

Cross-Partition Traversals Are Expensive

Dex holds up a caution sign When evaluating a distributed graph database, always ask: "What is the cross-partition ratio for my graph?" A graph with tight community structure (social networks, organizational hierarchies) partitions well — 80–90% of edges stay within partitions. A graph with random connectivity (citation networks, biological interaction graphs) may partition poorly — 40–60% of edges cross partitions, and every cross-partition hop becomes a network round trip. Your 5-hop traversal at 1 ms per hop on a single machine may become 250 ms when 4 of those hops cross partition boundaries.

Diagram: Graph Partitioning Challenge Visualizer

Interactive graph partitioning: show cross-partition edges and their cost across three machines

**sim-id:** graph-partitioning-viz **Library:** vis-network **Status:** Specified A vis-network canvas (900×520 px) with the graph divided into three visually distinct partition zones. **Visual layout:** - Three colored background regions (light blue, light orange, light green) labeled "Machine 1", "Machine 2", "Machine 3" - ~20 nodes distributed across the three machines, each node colored by its machine assignment - Edges within a machine: solid thin lines, colored to match the machine - Edges crossing machines: bold red dashed lines labeled with a simulated latency badge ("~10 ms network hop") **Controls:** - "Hash Partitioning" button: re-distributes nodes by random hash, typically producing many red cross-partition edges - "Community-Aware" button: re-distributes nodes by community (dense subgraphs co-located), dramatically reducing red edges - "Simulate 3-hop Traversal" button: starting from a selected node, animates 3 hops and accumulates a total latency counter (each cross-partition hop adds 10 ms, within-partition hop adds 0.1 ms) **Clicking any edge** shows: "Type: within-partition (0.1 ms)" or "Type: cross-partition (~10 ms)" **Learning objective:** Analyzing (Bloom's) — students observe how partitioning strategy directly affects multi-hop traversal latency, and understand why community-aware partitioning matters for graph workloads. Responsive: canvas resizes on window resize.

Knowledge Graphs: Graphs as Enterprise Data Integration

A knowledge graph is a structured representation of entities and the relationships between them, used to integrate heterogeneous data sources across an enterprise. Where a social graph represents people and friendships, and a fraud detection graph represents accounts and transactions, a knowledge graph represents everything — products, customers, suppliers, locations, events, compliance rules, and the relationships that connect them — in a unified, queryable graph.

Knowledge graphs combine property graph or RDF data models with an ontology — a formal vocabulary of entity types, relationship types, and the rules that govern them. The ontology acts as the schema: it defines that a Product must have a category, that a Customer can PURCHASED a Product, and that a Supplier SUPPLIES products in specific categories. With the ontology in place, a knowledge graph can answer questions that would require coordinating queries across six separate operational databases: "Show me all products we sell that are sourced from suppliers located in a region currently under a trade embargo."

Enterprise knowledge graphs are an active area of investment at large technology companies (Google's Knowledge Graph, Microsoft's Graph, LinkedIn's Entity Graph) and at regulated industries (financial services, pharmaceuticals, aerospace) where compliance, data lineage, and master data management require precisely the kind of relationship-first querying that graph databases provide.

Diagram: Enterprise Knowledge Graph Explorer

Interactive enterprise knowledge graph: entity types, relationship types, and multi-hop queries

**sim-id:** knowledge-graph-explorer **Library:** vis-network **Status:** Specified A vis-network canvas (900×560 px) showing a multi-domain enterprise knowledge graph. **Node types and colors (following project color conventions):** - Product: `#4682B4` (entity, blue) - Customer: `#28a745` (green) - Supplier: `#E65100` (orange) - Location: `#6f42c1` (purple) - ComplianceRule: `#6c757d` (gray) **Sample graph (~18 nodes, 25 edges):** - 4 Products, each connected to 1–2 Suppliers and 1 Category (as a property) - 3 Customers, each PURCHASED 1–3 Products - 3 Suppliers, each LOCATED_IN a Location - 2 Locations marked as having active trade restrictions (SUBJECT_TO → ComplianceRule node) - 2 ComplianceRule nodes **Controls:** - "Find At-Risk Products" button: highlights all Products connected (via Supplier → Location → ComplianceRule) to an at-risk region — a 3-hop traversal shown with animated path arrows - "Customer Overlap" button: highlights all Customers who purchased at least one Product in common, coloring the shared-product nodes amber - Dropdown filter: "Show node type" — filters visible nodes to a single type **Clicking any node** shows a full properties panel on the right. **Learning objective:** Evaluating (Bloom's) — students see how a knowledge graph enables multi-domain queries that would require complex cross-system coordination in a siloed data architecture. Responsive: resizes via vis-network `fit()` on window resize.

Key Takeaways

The property graph model stores nodes (entities with labels and properties), edges (typed, directed relationships with their own properties), making relationship traversal a constant-time pointer dereference rather than a computed join.
The RDF model expresses all data as subject-predicate-object triples with globally unique URI identifiers, optimized for data interoperability and ontology-driven reasoning rather than application development.
Cypher expresses graph patterns in an intuitive ASCII-art syntax (()-[:REL]->()); SPARQL pattern-matches against RDF triple stores using triple patterns with variables.
BFS explores the graph level-by-level and naturally finds shortest paths; DFS goes as deep as possible before backtracking and is suited for exhaustive path search and cycle detection.
Neo4j provides native graph storage with constant-time traversal, full ACID transactions, and the Cypher query language — the dominant property graph database.
TigerGraph distributes graph processing across a cluster using GSQL, enabling real-time deep-link analytics on graphs with billions of nodes and edges.
Amazon Neptune is a managed dual-model database supporting both property graph (Gremlin/openCypher) and RDF (SPARQL) interfaces against the same graph.
Graph partitioning — how nodes are distributed across machines in a distributed graph database — is the defining architectural challenge: cross-partition edges become expensive network hops during multi-hop traversal.
Knowledge graphs combine a property graph or RDF store with a formal ontology to create a unified, queryable representation of enterprise entities and their relationships across previously siloed data sources.

Chapter 8 Complete!

Dex celebrates with connected nodes forming a victory arc behind them You now think in graphs. You understand why following a relationship edge is architecturally different from computing a JOIN, why BFS finds shortest paths while DFS finds complete paths, and why distributing a graph across machines is fundamentally harder than distributing a key-value store. Chapter 9 zooms out to the theoretical foundations that govern all distributed databases — CAP theorem, consistency models, and the consensus protocols that make distributed agreement possible.