Chapter 1: Knowledge Graphs and Labeled Property Graphs¶
Summary¶
Introduces the Labeled Property Graph model — nodes, edges, labels, properties, Cypher — and explains why LPGs are the dominant enterprise choice over RDF and relational models.
Concepts Covered¶
This chapter covers the following 25 concepts from the learning graph:
- Knowledge Graph
- Labeled Property Graph
- Graph Database
- Node
- Edge
- Node Label
- Edge Type
- Graph Property
- Graph Schema
- Cypher Query Language
- openCypher Standard
- GQL Standard
- Graph Pattern Matching
- Graph Traversal
- Path Query
- Graph Index
- Graph Algorithm
- RDF Lacks Scalability
- Open World Assumption
- Closed World Assumption
- Graph vs Relational Model
- Graph vs Vector Store
- Graph Serialization Format
- GraphML
- GraphSON
Prerequisites¶
This chapter assumes only the prerequisites listed in the course description.
Hi! I'm Nexus.
Welcome to Context Graph: How Organizations Use LLMs Cost Effectively! I'm Nexus, a round-bodied spider with a fondness for connections and a deep belief that every edge in a graph exists for a reason. I'll be weaving through this book all the way to the end, but I don't appear randomly — I have exactly six jobs, and you'll learn to recognize which one I'm doing each time you see me:
- Welcome you at the opening of every chapter — that's what I'm doing right now.
- Help you think things through when an idea is the kind that graph builders wrestle with on a whiteboard for a while.
- Give you tips — the practical moves a working context-graph architect would make that don't always make it into documentation.
- Warn you gently about the places where smart projects and smart practitioners get into trouble.
- Encourage you when a concept looks harder than it really is on first contact.
- Celebrate with you at the end of each chapter when you've earned it.
That's it. If I'm not doing one of those six things, I'm not in the chapter. Let's trace the why!
Introduction¶
Most enterprise data systems are built around tables. A table is a rectangle: rows are instances, columns are attributes, and every attribute holds a scalar value. This model has served industry well for decades, but it encodes a subtle assumption — that the most important thing about data is what it describes, not how it connects to other things. When that assumption breaks down, so does the relational model.
A knowledge graph is a different bet. It represents knowledge as a network of entities and relationships, where the connections between things are just as important — often more important — than the properties of the things themselves. Knowledge graphs power the systems that answer questions such as: which supplier affects the most revenue if it goes down, which engineer on-call is closest to this incident's blast radius, and which prior decision precedents justify this automated approval. These are not lookup questions. They are traversal questions. They require following chains of relationships through a network of entities.
This chapter establishes the foundation: what a knowledge graph is, how the Labeled Property Graph (LPG) model implements it, what query languages operate on it, and why LPGs have decisively won the enterprise competition over the two main alternatives — RDF triplestores and relational databases. Every subsequent chapter builds on this vocabulary, so it is worth taking the time to internalize it here.
The Core Vocabulary: Nodes and Edges¶
Every graph consists of exactly two primitive types: nodes and edges.
A node (also called a vertex in graph theory) represents an entity — a thing that exists and has identity. In an enterprise knowledge graph, nodes might represent employees, servers, invoices, suppliers, products, contracts, incidents, or any other object the organization tracks. A node is the anchor point for facts about that entity.
An edge represents a relationship between two nodes. An edge is always directional: it has a source node (the from end) and a target node (the to end). This directionality is not a technicality — it encodes semantic meaning. The edge (Employee)-[:REPORTS_TO]->(Manager) means something fundamentally different from (Manager)-[:REPORTS_TO]->(Employee). Every edge in a production knowledge graph carries a direction, and that direction is a claim about how the world is structured.
Together, nodes and edges form a graph — a mathematical structure studied for nearly three centuries, since Euler's 1736 analysis of the Königsberg bridges. The enterprise knowledge graph is this same structure applied at organizational scale to answer operational questions in real time.
The interactive diagram below shows a small but representative enterprise LPG. Click any node or edge to see its label, type, and properties.
Diagram: Interactive LPG Explorer¶
Interactive LPG Explorer
Type: graph-model
sim-id: lpg-explorer
Library: vis-network
Status: Specified
Bloom Level: Understand (L2) Bloom Verb: Identify and describe Learning Objective: Learners can identify nodes, edges, labels, and properties in a Labeled Property Graph and explain the role of each element.
Purpose: Show a small but realistic enterprise LPG (supply-chain scenario) with clickable nodes and edges that reveal labels, types, and properties in a side panel. The learner should be able to distinguish nodes from edges, labels from types, and properties on nodes from properties on edges.
Sample graph scenario: A simplified supply chain with 10 nodes.
Node types (distinct colors and shapes): - Supplier (orange ellipse): properties: name, country, risk_tier - Examples: "Acme Components" (US, tier-1), "Bolts & More" (MX, tier-2) - Product (blue rectangle): properties: sku, unit_cost - Examples: "Widget-A" ($12.50), "Widget-B" ($7.20) - Warehouse (teal diamond): properties: location, capacity_units - Examples: "Chicago DC" (50,000 units), "Atlanta DC" (32,000 units) - Order (gray circle): properties: order_id, amount_usd, status - Examples: "ORD-2024-001" ($45,000, confirmed)
Edge types (distinct arrow styles): - SUPPLIES (solid orange arrow): properties: lead_time_days Direction: Supplier → Product - STOCKS (dashed teal arrow): properties: quantity_on_hand Direction: Warehouse → Product - FULFILLS (solid blue arrow): properties: fulfillment_date Direction: Warehouse → Order
Interactive behavior: - Hover any node: tooltip showing all properties (key: value pairs) - Click a node: highlight all directly connected nodes and edges; right-side detail panel shows label, type, and full property list - Click an edge: highlight endpoint nodes; detail panel shows edge type, direction (from → to), and edge properties - Legend (top-right): node shapes and colors mapped to labels - Zoom via mouse wheel; pan by dragging background - "Reset View" button returns to default layout
Layout: Force-directed, stabilized on load. Canvas fully responsive to window width; 520px height. Detail panel: 200px fixed right column, shows selected element's data. Color palette: Indigo (#3949ab) and orange (#f57c00) consistent with book theme.
Labels and Types: Giving the Graph Meaning¶
The LPG model adds two organizing concepts on top of raw nodes and edges: node labels and edge types.
A node label is a category tag attached to a node. In the supply chain example, Supplier, Product, Warehouse, and Order are node labels. Labels serve two functions simultaneously. First, they communicate the semantic role of the node — a node labeled Supplier participates in different relationships than one labeled Order. Second, they enable efficient filtering: a query that only needs suppliers tells the database to scan only nodes with the Supplier label, not every node in the graph. A single node may carry multiple labels — an entity might be both an Employee and a Contractor — making the LPG model naturally accommodating for entities that exist in multiple categories.
An edge type is the analogous tag on a relationship. SUPPLIES, STOCKS, and FULFILLS are edge types. Edge types encode the verb — the nature of the relationship. A graph with richly typed edges reads almost like prose: "Acme Components SUPPLIES Widget-A; Chicago DC STOCKS Widget-A; Chicago DC FULFILLS ORD-2024-001." This readability is deliberate. It makes the graph self-documenting and reduces the cognitive overhead of schema interpretation when engineers are debugging a query or reviewing a decision trace.
Properties: Data Lives on the Graph¶
The third pillar of the LPG model is graph properties. A property is a key-value pair attached to a node or an edge. Node properties describe the entity: a Supplier node might carry {name: "Acme Components", country: "US", risk_tier: "tier-1"}. Edge properties describe the relationship itself: a SUPPLIES edge might carry {lead_time_days: 14}.
This is where the LPG model diverges most sharply from earlier graph models. In a plain graph (just nodes and edges with no properties), all data about an entity must live in attached lookup tables or separate structures. In an LPG, properties are first-class inhabitants of the graph structure itself. This matters for traversal performance: when a query follows an edge and needs the edge's lead_time_days property, the database can return it directly from the edge object without issuing a secondary lookup.
The table below maps LPG elements to their closest relational analogues. The analogy is imperfect but provides a useful orientation point for practitioners coming from a SQL background.
| LPG Element | Description | Closest Relational Analogue |
|---|---|---|
| Node | An entity with an identity | A row in a table |
| Node Label | The entity's type category | The table name |
| Edge | A named, directed relationship | A foreign key constraint |
| Edge Type | The semantic role of the relationship | The join table name |
| Node Property | An attribute of the entity | A column value in the row |
| Edge Property | An attribute of the relationship | A column in the junction table |
Key Insight: Properties on Edges
In a relational schema, a many-to-many relationship with attributes requires an explicit junction table — for example, order_line_items with quantity and unit_price columns. In the LPG model, those attributes live directly on the edge. This sounds like a minor convenience, but it changes the traversal: edge properties are retrieved in the same step that crosses the edge, not in a separate join. For context graphs that frequently query decision traces — which are chains of attributed relationships — this is architecturally significant. Every node tells a story; so does every edge.
Graph Schema: Flexible Yet Constrained¶
A graph schema defines the allowed node labels, edge types, and property keys in a graph, along with any constraints on their values. Graph databases vary widely in how strictly they enforce schemas.
Schema-optional databases (sometimes called schema-free or schema-agnostic) allow any node to carry any label and any property without prior declaration. This flexibility accelerates early development and accommodates heterogeneous data sources — a common enterprise reality where incoming data from a CRM, an ERP, and a logging system may have inconsistent property vocabularies. The cost is that constraint enforcement falls to the application layer, which in practice means violations accumulate silently until they surface as query anomalies.
Schema-full databases, and the property-graph type system formalized in the GQL standard, allow teams to declare that every Supplier node must have a name string property and a risk_tier enumeration drawn from an allowed set of values. Violations are rejected at write time, not discovered during an incident. For production enterprise context graphs where decision traces must be auditable, schema constraints are not optional — they are the mechanism by which the graph earns its role as a system of record.
Graph Databases: Native Storage Architecture¶
A graph database is a database management system designed around graph data as a first-class storage primitive. The term is often misused to describe any system that can query graph-shaped data, including relational databases with recursive SQL, document stores with reference fields, and search engines with relationship metadata. These are not graph databases in the architectural sense.
A true graph database uses native graph storage, which means relationships (edges) are stored as physical pointers to their endpoint nodes rather than as data values that require index lookups at query time. This architecture enables index-free adjacency: traversing from one node to its neighbors is a pointer dereference, not a B-tree lookup. The cost of following one hop is independent of the total size of the graph — whether the graph contains ten thousand nodes or ten billion, following a single edge takes the same time.
This property has dramatic consequences for multi-hop queries — queries that traverse two, three, or more relationships in sequence. In a relational database, each additional hop requires an additional JOIN, and each JOIN's cost grows with the size of the tables being joined. In a natively stored graph, each hop costs a constant-time pointer dereference, so a five-hop traversal costs roughly five times a one-hop traversal rather than five powers of the table size. The performance chart below quantifies the divergence.
Diagram: Multi-Hop Query Performance¶
Multi-Hop Query Performance: Native Graph vs Relational
Type: chart
sim-id: multi-hop-performance
Library: Chart.js
Status: Specified
Bloom Level: Analyze (L4) Bloom Verb: Compare and contrast Learning Objective: Learners can explain why native graph traversal performance grows linearly with hop count while relational JOIN performance degrades super-linearly, and identify the crossover point where graph performance becomes practically necessary.
Chart type: Grouped bar chart
X-axis label: "Number of Hops" X-axis values: 1, 2, 3, 4, 5 Y-axis label: "Query Response Time (ms)" Y-axis: Logarithmic scale, range 1ms to 1,000,000ms
Data series: 1. Relational Database (orange bars): - 1 hop: 8ms - 2 hops: 90ms - 3 hops: 1,400ms - 4 hops: 28,000ms - 5 hops: 650,000ms
- Native Graph Database (indigo bars):
- 1 hop: 4ms
- 2 hops: 7ms
- 3 hops: 11ms
- 4 hops: 15ms
- 5 hops: 19ms
Interactive behavior: - Hover any bar: tooltip shows exact milliseconds formatted for readability (e.g., "28,000 ms ≈ 28 seconds") - Click a hop's bar pair: highlight both bars at that hop count; display a callout showing the ratio (e.g., "At 4 hops: relational is 1,867× slower") - Toggle button below chart: switch Y-axis between logarithmic and linear scales. On linear scale the relational 5-hop bar visually dominates the entire chart area, making the practical consequence visceral. - Annotation on relational 5-hop bar: "~11 minutes — effectively unusable for real-time queries" - Annotation on graph series line: "~4ms per additional hop — near-linear growth"
Title: "Multi-Hop Query Performance: Native Graph vs Relational" Legend: Top-right
Color palette: Indigo (#3949ab) for graph, orange (#f57c00) for relational. Canvas: Responsive width, 480px height.
Not All 'Graph Databases' Are Architecturally Equal
Several relational and document databases have added graph query layers or recursive-query extensions and now market themselves as "graph-capable." These systems do not use native graph storage — they translate graph traversals into table scans or B-tree lookups under the hood. For small graphs or shallow traversals, the distinction may not matter. For production context graphs with millions of decision traces and multi-hop audit queries, it matters enormously. When evaluating a graph platform, ask specifically: "Does the engine use index-free adjacency for edge traversal, or does it resolve relationships through index lookups?" That answer tells you whether you have a graph database or a relational database with graph syntax.
Cypher: The Language of Graph Queries¶
To query a Labeled Property Graph, you need a language that can express graph patterns — not just row filters. Cypher is that language. Originally developed for Neo4j and later opened as a community standard, Cypher uses ASCII-art notation to represent graph patterns directly in the query text.
Two constructs are fundamental. A node is written in parentheses: (n:Supplier) matches any node with the Supplier label and binds it to the variable n. A relationship is written in brackets flanked by dashes: -[:SUPPLIES]-> matches any outgoing SUPPLIES edge. Chaining these together produces a path pattern: (s:Supplier)-[:SUPPLIES]->(p:Product) matches any supplier-product pair connected by a SUPPLIES relationship.
A complete Cypher query follows the MATCH … WHERE … RETURN structure:
MATCH (s:Supplier)-[:SUPPLIES]->(p:Product)<-[:STOCKS]-(w:Warehouse)
WHERE s.risk_tier = 'tier-1'
AND w.location STARTS WITH 'Chicago'
RETURN s.name AS supplier, p.sku AS product, w.location AS warehouse
ORDER BY supplier
This query traverses two hops — from supplier to product, then from product to warehouse — in a single readable pattern. There is no JOIN clause, no subquery, and no need to reason about join order. The pattern is the query plan.
The openCypher Standard is an open specification of the Cypher language maintained by the openCypher project, decoupling the language from any single vendor's implementation. GQL (Graph Query Language) is the ISO/IEC international standard ratified in 2024, which incorporates the core of Cypher and adds richer compositional and procedural constructs. GQL occupies the same role for graph databases that SQL occupies for relational databases: a vendor-neutral standard that allows queries to be written once and executed on any conformant system.
Graph Pattern Matching, Traversal, and Paths¶
Three related but distinct concepts govern how queries navigate graph structure.
Graph pattern matching is the process of finding all subgraphs within a larger graph that match a specified template. The Cypher MATCH clause is a pattern matching operation. The pattern (a)-[:REPORTS_TO*1..4]->(b) instructs the database to find all pairs (a, b) connected by one to four REPORTS_TO edges — a variable-length path pattern that would require a recursive common table expression (CTE) in SQL and significant query-plan tuning effort.
Graph traversal is the lower-level process of systematically visiting nodes by following edges from a starting point. Two classic traversal algorithms underlie most graph query engines. Depth-first search (DFS) follows one path as far as possible before backtracking. Breadth-first search (BFS) explores all immediate neighbors before going deeper. For enterprise context graphs, BFS is often preferred for proximity queries — finding all decisions made within two hops of a given entity — because it naturally surfaces the closest results first.
A path query is a query whose answer is not just a set of matching nodes but the sequence of nodes and edges connecting them — the path itself. Path queries are essential for audit and explainability use cases: when a compliance officer asks "show me the chain of decisions that authorized this payment," the answer is a path, not a set of matching nodes.
Traversal Looks Harder Than It Is
If you're coming from SQL, variable-length path queries feel strange at first — there is no direct analogue, and recursive CTEs are notoriously awkward. In Cypher, variable-length paths are first-class syntax: [:APPROVES*1..5] means "follow one to five APPROVES edges." The engine handles the traversal logic. Your job is to express the pattern you want, not the traversal strategy that finds it. Every graph builder reports the same experience after the first week: it clicks, and then SQL starts to feel like the awkward one. You've got this.
The MicroSim below lets you step through BFS and DFS traversals on a small enterprise graph and observe the visit ordering at each step.
Diagram: Graph Traversal Explorer¶
Graph Traversal Explorer MicroSim
Type: microsim
sim-id: graph-traversal-explorer
Library: p5.js
Status: Specified
Bloom Level: Understand (L2) Bloom Verb: Explain and trace Learning Objective: Learners can trace the step-by-step execution of BFS and DFS traversal on a sample enterprise graph and explain why the two algorithms produce different node visit orderings.
Instructional Rationale: Step-through with worked examples is appropriate because the Understand/explain objective requires learners to observe concrete visit sequences with real node labels. Continuous animation would prevent prediction and obscure the actual decision logic at each step. Each step should give the learner a moment to predict the next node before the simulation reveals it.
Canvas layout: - Main area (left ~68%): 12-node enterprise graph showing Employee and Department entities - Right panel (~32%): Step log showing nodes visited in order, plus current algorithm state (stack contents for DFS, queue contents for BFS)
Graph structure (fixed, readable, enterprise scenario): Nodes: - E1: CEO (Employee), E2: VP-Engineering (Employee), E3: VP-Sales (Employee) - E4: Eng-Lead (Employee), E5: Eng-A (Employee), E6: Eng-B (Employee) - E7: Sales-Lead (Employee), E8: Sales-A (Employee) - D1: Engineering (Department), D2: Sales (Department), D3: Executive (Department) - P1: Project-Alpha (Project)
Edges: - E2 REPORTS_TO E1, E3 REPORTS_TO E1 - E4 REPORTS_TO E2, E5 REPORTS_TO E4, E6 REPORTS_TO E4 - E7 REPORTS_TO E3, E8 REPORTS_TO E7 - E2 BELONGS_TO D1, E4 BELONGS_TO D1, E5 BELONGS_TO D1, E6 BELONGS_TO D1 - E3 BELONGS_TO D2, E7 BELONGS_TO D2, E8 BELONGS_TO D2 - E1 BELONGS_TO D3 - E4 OWNS P1
Interactive controls: - Dropdown: Algorithm (BFS or DFS) - Dropdown: Start Node (any of the 12 nodes, labeled by name) - Button: "Next Step" — advance one traversal step - Button: "Run All" — complete traversal with 500ms delay between steps - Button: "Reset" — return to initial unvisited state
Data visibility requirements: Stage 0: All nodes gray (unvisited). Right panel: Queue/Stack shows [start node name]. Stage 1: Start node highlighted yellow. Right panel: "Visiting: [label]"; queue/stack updated after dequeue/pop. Stage N: Each new step: current node turns yellow → indigo (visited) after step completes. Newly discovered unvisited neighbors shown in orange (in-queue/stack). Right panel shows "Visiting: [node name]" and appends to the ordered visit list. Final: All reachable nodes indigo. Panel shows complete visit order as a numbered list of node labels.
Node visual states: - Unvisited: gray fill, dark border - In queue/stack: orange fill - Currently visiting: yellow fill with subtle pulse border - Visited: indigo fill, white label text
Hover behavior: Hovering any node shows a tooltip with label (name and type) and list of neighbor names. Edge hover: Shows edge type (REPORTS_TO / BELONGS_TO / OWNS).
Canvas: Responsive width, 520px height. Right panel fixed 200px.
Graph indexes are auxiliary data structures that accelerate graph queries without modifying the graph's topology. A node property index on the name property of Employee nodes allows the query engine to locate a specific employee in \(O(\log n)\) time rather than scanning every employee node. Indexes are especially important for the entry point of a traversal — the query must first locate the starting node efficiently, after which index-free adjacency takes over.
Graph algorithms are higher-level computations on graph topology: PageRank measures node influence by iterating over the link structure, community detection identifies clusters of densely interconnected nodes, and shortest-path algorithms find the minimum-cost route between two nodes. These algorithms produce results that no single node's properties can yield alone — they emerge from the structure of the graph itself.
Why LPGs Win: Comparisons with RDF, Relational, and Vector Stores¶
Understanding LPGs' enterprise dominance requires understanding what they replaced and what they coexist with.
The RDF Alternative and Its Limitations¶
Resource Description Framework (RDF) is a W3C standard for representing knowledge as triples: subject-predicate-object. An RDF statement might read <acme_components> <supplies> <widget_a>. RDF offers two theoretical advantages: formal semantics grounded in first-order logic, and the open world assumption — the principle that the absence of a fact from the database does not mean the fact is false, only that it is unknown.
The open world assumption makes RDF well-suited for knowledge interchange between autonomous agents that may have incomplete views of the world. Two RDF stores can be merged without conflict because neither claims to have the complete picture. This is philosophically coherent and practically useful in certain research and linked-data contexts.
Enterprise operations, however, require the opposite: the closed world assumption, which holds that what is not recorded is not true. When a compliance system queries whether a specific approval was granted, a "not found" result must mean "not approved," not "possibly approved somewhere we don't know about." This distinction is not academic — it determines whether automated decision systems can safely rely on negative query results. LPGs operate under the closed world assumption by default, aligning them with the semantics enterprise teams actually need.
RDF lacks scalability for the multi-hop query patterns enterprises run in production. The triple store data model requires every relationship attribute to be reified — turned into its own set of triples — because plain RDF triples have no properties. Encoding {lead_time_days: 14} on a supplies relationship requires four or more triples and a blank node, producing verbose graphs that are slow to store, slow to query, and difficult to maintain at scale. SPARQL, the RDF query language, lacks the variable-length path syntax and native edge-property access that Cypher provides. Real-world enterprise RDF graphs have hit practical ceilings at tens of millions of triples; commercial LPG databases routinely operate at tens of billions of edges.
Graph vs. Relational Model¶
The multi-hop performance gap is quantified in the chart above. The deeper architectural difference is that relational databases treat relationships as data — foreign key values resolved through B-tree index lookups at query time. Every additional hop in a query adds a JOIN, and each JOIN's cost is bounded below by the logarithm of the table size being scanned. For the vast majority of enterprise reporting queries — aggregations, point lookups, single-hop joins — this cost is acceptable. For the multi-hop traversal queries that context graphs exist to answer, it is not.
Graph vs. Vector Store¶
A vector store (also called a vector database) indexes high-dimensional embedding vectors and answers nearest-neighbor queries: given a query vector, find the \(K\) stored vectors most similar to it. Vector stores excel at semantic similarity — finding documents, decisions, or entities that are conceptually related to a query even when they share no exact keywords. They are the backbone of retrieval-augmented generation (RAG) systems.
The critical distinction from a knowledge graph is that vector similarity is statistical, not structural. A vector store can tell you that two decisions are semantically similar; it cannot tell you that one decision directly caused another, that one entity is the legal owner of another, or that a particular approval chain is required for a given transaction class. Knowledge graphs encode structural relationships with explicit semantics. Context graphs specifically encode the causal and procedural chains — the why — that vector similarity cannot reconstruct. Both technologies belong in a mature enterprise AI architecture; they answer different questions and are most powerful in combination.
The interactive comparison below lets you examine all four data models side by side and test each against representative enterprise query types.
Diagram: Data Model Comparison Explorer¶
Data Model Comparison Explorer
Type: infographic
sim-id: data-model-comparison
Library: p5.js
Status: Specified
Bloom Level: Analyze (L4) Bloom Verb: Compare and differentiate Learning Objective: Learners can compare the Relational, RDF, LPG, and Vector Store data models across five performance dimensions and justify which model is appropriate for a given enterprise query type.
Layout: Two sections — a model visualizer (top ~60%) and a query benchmark table (bottom ~40%).
Top section — four tab buttons: - "Relational", "RDF / Triplestore", "Labeled Property Graph", "Vector Store" - Active tab highlighted in indigo; inactive tabs in light gray - Clicking a tab animates transition to that model's visualization
Model visualizations (supply-chain scenario, consistent across all four panels):
- Relational: Three tables (Suppliers, Products, SuppliesJunction). Foreign key columns highlighted. A "relationship attribute" column (lead_time_days) shown in the junction table with a callout: "Attribute lives in a separate table — requires JOIN to retrieve."
- RDF / Triplestore: Seven triples in <subject> <predicate> <object> notation. A reification block shows four additional triples plus a blank node required to encode a single edge property. Callout: "4+ triples needed to express one attributed relationship."
- Labeled Property Graph: Three nodes (Supplier, Product, Warehouse) connected by two typed, directed edges. Each edge shows inline property badge (lead_time_days: 14). Callout: "Edge holds its own properties — retrieved in the same traversal step."
- Vector Store: 2D t-SNE scatter of six document embedding points. Arrows show k-nearest-neighbor query result. Callout: "Finds semantically similar things — but cannot express why they connect."
Bottom section — query benchmark table (5 rows × 4 columns): Query types (rows): Single-entity lookup | Two-hop traversal | Five-hop traversal | Semantic similarity | Causal chain audit Models (columns): Relational | RDF | LPG | Vector Store
Cell values (color-coded: green=Fast, yellow=Moderate, orange=Slow, red=Very Slow, gray=N/A): - Single lookup: all four Fast - Two-hop traversal: Relational Moderate, RDF Moderate, LPG Fast, Vector N/A - Five-hop traversal: Relational Very Slow, RDF Slow, LPG Fast, Vector N/A - Semantic similarity: Relational Poor, RDF Poor, LPG Moderate, Vector Excellent - Causal chain audit: Relational Very Slow, RDF Slow, LPG Fast, Vector N/A
Interactive behavior: - Hover any benchmark cell: one-sentence tooltip explaining the rating (e.g., "Five-hop traversal in a relational database requires five JOIN operations whose combined cost grows super-linearly with table size") - Hover any structural element in the visualization: tooltip showing its name and role in the model - Tab transitions use a 200ms crossfade animation
Canvas: Responsive width, 500px height. Color palette: Indigo/orange/teal consistent with book.
The table below consolidates the three-way architectural comparison as a reference:
| Dimension | Relational (RDBMS) | RDF Triplestore | Labeled Property Graph |
|---|---|---|---|
| Data model | Tables, rows, columns | Subject-predicate-object triples | Nodes, edges, labels, properties |
| Multi-hop performance | Degrades super-linearly with hops | Degrades with reification overhead | Constant per hop (index-free adjacency) |
| World assumption | Closed | Open | Closed |
| Edge properties | Via junction table (separate JOIN) | Requires blank-node reification | Native — first-class on the edge |
| Query language | SQL | SPARQL | Cypher / GQL |
| Schema flexibility | Rigid (DDL required before write) | Flexible | Flexible with optional enforcement |
| Practical enterprise scale | Billions of rows | Tens of millions of triples | Tens of billions of edges |
Serialization: Moving Graphs Between Systems¶
A graph serialization format is a file format for encoding a graph's structure and data as portable text or binary. Serialization enables import/export between databases, backup and restore, integration testing with fixture data, and API-based data exchange between graph platforms.
Two formats dominate in enterprise LPG contexts. Both encode the same conceptual structure — serialized nodes carry their labels and properties; serialized edges carry their type, directionality, endpoint node identifiers, and edge properties. The difference is encoding syntax, tooling ecosystem, and verbosity at scale.
GraphML is an XML-based format originally developed for the yEd graph editor. GraphML encodes nodes and edges as XML elements with properties as nested child elements or attributes. Its verbosity makes it suitable for small and medium graphs but inefficient at large scale. GraphML is widely supported across graph visualization tools and is a reasonable choice for human-readable interchange and archival purposes.
GraphSON is a JSON-based format developed as part of the Apache TinkerPop framework. GraphSON is more compact than GraphML for large graphs and integrates naturally with web services and modern data pipelines that already process JSON. GraphSON version 3, which maps directly to the TinkerPop graph structure, is the current standard. Most major LPG databases — Neo4j, Amazon Neptune, TigerGraph — support GraphSON import and export out of the box.
The choice between the two is primarily driven by tooling context: if the downstream consumer speaks XML (a legacy ETL pipeline, an XML-native integration platform), GraphML is the path of least resistance. If the downstream consumer speaks JSON (a REST API, a cloud data pipeline, a Python application), GraphSON is the natural choice.
Summary and Key Takeaways¶
This chapter has established the LPG vocabulary that every subsequent chapter builds on. The concepts covered, in pedagogical order, are:
- A knowledge graph represents entities and their relationships as a network, where connections carry as much meaning as the entities themselves.
- An LPG structures that network with nodes (entities), edges (named, directed relationships), node labels (type categories), edge types (relationship semantics), and properties (key-value attributes on both nodes and edges).
- A graph schema governs what labels, types, and properties are valid; the closed world assumption ensures that absent facts are treated as false, not unknown.
- Native graph databases use index-free adjacency to achieve constant-time traversals regardless of hop count, enabling multi-hop queries that are practically impossible at scale in relational systems.
- Cypher and its standards descendants (openCypher and GQL) express graph queries as ASCII-art patterns, separating the what to find from the how to traverse.
- Graph pattern matching finds matching subgraphs; graph traversal visits nodes systematically; path queries return the sequence of nodes and edges between two endpoints.
- Graph indexes accelerate entry-point lookups; graph algorithms (PageRank, shortest path, community detection) compute properties of graph topology that no single node's data can yield.
- RDF lacks scalability because edge attributes require verbose reification, and its open world assumption makes negative query results semantically unreliable for enterprise automation.
- LPGs and vector stores are complementary: LPGs answer structural, causal, and traversal questions; vector stores answer semantic-similarity questions.
- GraphML and GraphSON are the dominant serialization formats for LPG portability, with GraphSON being the JSON-native choice for modern pipelines.
Quick Check: Which Data Model?
Three enterprise questions, three different data models. Which model is the right tool for each?
- "Find all products whose tier-1 suppliers are in the same country as an active regulatory investigation, where the warehouse stocking those products has capacity below 10,000 units."
- "Find the five compliance memos most semantically similar to this new regulation text."
- "Show the complete chain of approvals — with timestamps and approver names — that authorized this $2M contract."
Click to reveal:
Answer: (1) Labeled Property Graph — this is a three-hop structural traversal (supplier → product → warehouse → regulatory flag) with property filters at each node. A relational JOIN chain would be slow at scale and the pattern would be brittle to maintain. (2) Vector store — this is a semantic similarity query over document embeddings. Knowledge graphs encode structural relationships, not semantic proximity between unstructured text. (3) LPG path query — the answer is not a set of matching nodes but the path itself: the ordered sequence of approval edges with their properties. Cypher's variable-length path syntax returns this naturally.
Chapter 1 Complete!
You now hold the full LPG vocabulary: nodes, edges, labels, types, properties, schema, Cypher, GQL, traversal, path queries, indexes, and algorithms. You've seen why native graph storage beats relational databases at multi-hop traversal, why RDF's open world assumption is a liability in closed enterprise systems, and why vector stores and knowledge graphs answer fundamentally different questions. That is a rigorous foundation. Chapter 2 scales this model up to the enterprise level — multiple source systems, entity resolution across heterogeneous data, and federated graph architectures. Every node tells a story. Let's trace the why!