Skip to content

Performance, Metrics, and Benchmarking

Summary

This chapter explores the performance characteristics that make graph databases excel at relationship-heavy workloads. You'll learn about index-free adjacency's constant-time neighbor access, understand key performance metrics like hop count and node degree, and master indexing strategies including vector indexes and composite indexes. The chapter covers industry-standard benchmarks like LDBC SNB and Graph 500, teaching you how to measure query latency, throughput, and scalability while comparing traversal costs against traditional join operations.

Concepts Covered

This chapter covers the following 20 concepts from the learning graph:

  1. Hop Count
  2. Degree of Node
  3. Indegree
  4. Outdegree
  5. Edge-to-Node Ratio
  6. Graph Indexes
  7. Vector Indexes
  8. Full-Text Search
  9. Composite Indexes
  10. Graph Metrics
  11. Performance Benchmarking
  12. Synthetic Benchmarks
  13. Single-Node Benchmarks
  14. Multi-Node Benchmarks
  15. LDBC SNB Benchmark
  16. Graph 500
  17. Query Cost Analysis
  18. Join Operations
  19. Traversal Cost
  20. Scalability

Prerequisites

This chapter builds on concepts from:


Show Me the Numbers: Why Skepticism Makes You a Better Engineer

Let's get something straight: If your colleagues are skeptical when you propose using a graph database, that's a good sign. It means you work with real engineers who don't just chase the latest trend or trust vendor marketing. Skepticism is the foundation of good engineering.

You've probably heard the claims: "Graph databases are 1000× faster for connected data!" "Traversals that take minutes in SQL run in milliseconds!" "This will revolutionize our data layer!"

And you should absolutely, positively, 100% question those claims.

Here's the thing though: good engineers don't just stay skeptical—they test. They measure. They benchmark. They gather data. And then they make informed decisions based on evidence, not hype.

This chapter is about becoming that engineer. The one who walks into a meeting with actual benchmark results. The one who can say, "I tested this on our data model, and here's what I found." The one whose opinion carries weight because it's backed by numbers, not vendor promises.

When you propose a graph database (or any technology) and someone says, "Prove it," you'll be able to say, "I did. Here are the benchmarks." That's how you build a reputation as someone who evaluates objectively, measures rigorously, and makes data-driven decisions.

Your skeptical colleagues will respect that. And honestly? You should be skeptical too. Trust, but verify. Or better yet: verify, then trust.

Ready to learn how to benchmark graph databases properly? Let's dive into the metrics that matter, the tests that prove (or disprove) the claims, and how to build your reputation as the engineer who brings receipts.

Graph Metrics: The Numbers That Tell the Story

Before we can benchmark anything, we need to understand what to measure. Graph databases have some unique performance characteristics tied to graph structure. Let's break down the key metrics.

Hop Count: Measuring Relationship Distance

Hop count is the number of edges you traverse to get from one node to another. It's one of the most important performance metrics for graph databases.

Why it matters: In graph databases, query performance is heavily influenced by how many hops you need to traverse. Remember the performance cliff from Chapter 1? That's hop count in action.

Examples: - 1-hop: Alice's direct friends → (alice)-[:FRIEND_OF]->(friend) - 2-hop: Friends of friends → (alice)-[:FRIEND_OF]->()-[:FRIEND_OF]->(fof) - 3-hop: Friends of friends of friends → (alice)-[:FRIEND_OF*3]-(connection)

Performance relationship: - Graph databases: Linear or near-linear scaling with hop count - RDBMS with JOINs: Exponential degradation with hop count

Measuring hop count in queries:

1
2
3
4
// Find connections up to 3 hops away and count path lengths
MATCH path = (alice:Person {name: "Alice"})-[:FRIEND_OF*1..3]-(connected)
RETURN length(path) AS hops, count(connected) AS connections
ORDER BY hops;

Benchmark insight: When you benchmark, always measure performance across different hop counts (1, 2, 3, 4, 5). This shows whether the database maintains linear performance or hits the exponential cliff.

Degree of Node: Measuring Connectivity

Degree is the number of relationships connected to a node. High-degree nodes (hubs) can significantly impact query performance.

Why it matters: Traversing from a high-degree node can be expensive if you're following all relationships. A celebrity with 10 million followers is a very different query than a regular user with 200 friends.

Types of degree:

Degree (total):

1
2
3
// Count all relationships (incoming + outgoing)
MATCH (alice:Person {name: "Alice"})-[r]-()
RETURN count(r) AS degree;

Use case: Understanding overall connectivity, finding hub nodes, identifying influencers

Indegree: Incoming Relationships

Indegree is the number of incoming relationships (edges pointing to the node).

1
2
3
// Count incoming relationships
MATCH (alice:Person {name: "Alice"})<-[r]-()
RETURN count(r) AS indegree;

Examples: - Social media: Follower count (indegree) vs following count (outdegree) - Citation graphs: How many papers cite this paper (indegree) - Supply chain: How many suppliers feed into this factory (indegree)

Benchmark insight: High-indegree nodes can slow down certain queries. When benchmarking, test queries that traverse from high-indegree nodes to see if performance degrades.

Outdegree: Outgoing Relationships

Outdegree is the number of outgoing relationships (edges pointing from the node).

1
2
3
// Count outgoing relationships
MATCH (alice:Person {name: "Alice"})-[r]->()
RETURN count(r) AS outdegree;

Examples: - Social media: Following count - Citation graphs: How many papers this paper cites - Supply chain: How many customers this supplier serves

Performance consideration: Expanding from a high-outdegree node (e.g., a user who follows 50,000 people) can be slow if you need to process all relationships.

Degree Distribution: The Shape of Your Graph

Real-world graphs often follow a power-law distribution: most nodes have low degree, a few nodes have very high degree (the "hubs").

Why this matters for benchmarking: - Queries that hit low-degree nodes perform differently than queries hitting hubs - Benchmarks should test both average-degree and high-degree scenarios - Your synthetic test data should match the degree distribution of your real data

Finding high-degree nodes:

1
2
3
4
5
// Find top 10 most connected people
MATCH (p:Person)-[r]-()
RETURN p.name, count(r) AS degree
ORDER BY degree DESC
LIMIT 10;

Benchmark best practice: When generating synthetic data, make sure degree distribution matches production. If your production graph has power-law distribution, your benchmark should too.

Edge-to-Node Ratio: Graph Density

Edge-to-node ratio measures how densely connected your graph is.

Formula:

1
Edge-to-Node Ratio = Total Edges / Total Nodes

Examples: - Sparse graph: E/N = 2 (average node has 2 connections) - Medium density: E/N = 10 (average node has 10 connections) - Dense graph: E/N = 100 (average node has 100 connections)

Why it matters: Dense graphs behave differently than sparse graphs. Variable-length path queries on dense graphs can explode combinatorially.

Measuring your graph:

1
2
3
4
5
6
7
// Calculate edge-to-node ratio
MATCH (n)
WITH count(n) AS nodeCount
MATCH ()-[r]->()
RETURN count(r) AS edgeCount,
       nodeCount,
       count(r) * 1.0 / nodeCount AS ratio;

Benchmark insight: Always report edge-to-node ratio when publishing benchmark results. A benchmark on a sparse graph (E/N = 2) vs dense graph (E/N = 50) tells very different stories.

Graph Indexes: Making Queries Fast

Indexes are critical for graph database performance. You can't benchmark properly without understanding how indexes work.

Graph Indexes: Finding Starting Points

Graph indexes help you find nodes quickly to start your traversal. Without indexes, you're scanning every node—slow and painful.

Why they matter: Most Cypher queries start with a MATCH that finds specific nodes:

1
2
MATCH (alice:Person {name: "Alice"})-[:FRIEND_OF]->(friend)
RETURN friend;

That {name: "Alice"} lookup needs an index, or the database scans all Person nodes.

Creating indexes in Neo4j:

1
2
3
4
5
6
7
8
// Index on Person.name
CREATE INDEX person_name FOR (p:Person) ON (p.name);

// Index on Person.email
CREATE INDEX person_email FOR (p:Person) ON (p.email);

// Index on Product.sku
CREATE INDEX product_sku FOR (p:Product) ON (p.sku);

Benchmark impact: - Without index: Query scans all nodes → O(n) time - With index: Query uses index seek → O(log n) time

On a 10-million-node graph: - Without index: ~5 seconds - With index: ~5 milliseconds

That's a 1000× difference!

Benchmark best practice: When comparing graph databases to RDBMS, make sure both have appropriate indexes. An unfair comparison (indexed RDBMS vs non-indexed graph DB) proves nothing.

Vector Indexes: Similarity Search at Scale

Vector indexes enable similarity search and nearest-neighbor queries, crucial for recommendation systems and AI applications.

What they do: Store high-dimensional vectors (embeddings) and enable fast similarity search.

Example use case:

1
2
3
4
5
6
// Find products similar to what Alice bought (using vector embeddings)
MATCH (alice:Person {name: "Alice"})-[:PURCHASED]->(product:Product)
CALL db.index.vector.queryNodes('productEmbeddings', 5, product.embedding)
YIELD node AS similar, score
RETURN similar.name, score
LIMIT 10;

Why this matters for graph databases: Modern applications combine graph relationships with vector similarity. You might: - Find friends who like similar content (graph + vectors) - Recommend products based on purchase patterns AND product similarity - Detect fraud rings using network structure AND behavioral embeddings

Benchmark consideration: If your use case involves similarity search, benchmark vector index performance alongside graph traversal. Many graph databases now support vector indexes natively (Neo4j, Amazon Neptune, TigerGraph).

Full-Text Search: Finding Nodes by Content

Full-text search indexes allow searching node properties using text matching, wildcards, and fuzzy search.

Creating full-text index:

1
2
// Create full-text index on Person name and bio
CREATE FULLTEXT INDEX personSearch FOR (p:Person) ON EACH [p.name, p.bio];

Using full-text search:

1
2
3
4
5
// Find people whose name or bio contains "graph database"
CALL db.index.fulltext.queryNodes('personSearch', 'graph database')
YIELD node, score
RETURN node.name, node.bio, score
ORDER BY score DESC;

Benchmark scenario: If your application needs text search (e.g., "find all customers who mentioned 'defect' in support tickets"), include full-text queries in your benchmark suite.

Composite Indexes: Multi-Property Lookups

Composite indexes index multiple properties together, enabling fast lookups on property combinations.

Example:

1
2
3
4
5
6
7
// Composite index on city and age
CREATE INDEX person_city_age FOR (p:Person) ON (p.city, p.age);

// Fast lookup: people in Seattle aged 30-40
MATCH (p:Person)
WHERE p.city = "Seattle" AND p.age >= 30 AND p.age <= 40
RETURN p.name, p.age;

Why composite indexes matter: Single-property indexes only help when filtering on that property. If you filter on city AND age, a composite index can use both properties for the lookup.

Benchmark insight: When benchmarking, test realistic query patterns. If your application frequently filters on multiple properties, composite indexes can dramatically improve performance.

Performance Benchmarking: Measuring What Matters

Now we get to the heart of this chapter: actually measuring performance. This is where you stop trusting vendor claims and start generating your own data.

Why Benchmark?

Good reasons to benchmark: 1. Verify vendor claims: Is it really 1000× faster? Let's find out. 2. Compare alternatives: RDBMS vs Graph DB vs Document DB—which fits your data? 3. Capacity planning: How many QPS can this system handle? 4. Optimize queries: Which query pattern is fastest? 5. Build credibility: Walk into meetings with data, not opinions

Bad reasons to benchmark: 1. ❌ "I want graphs to win" (confirmation bias) 2. ❌ "I need to justify a decision I already made" (motivated reasoning) 3. ❌ "The vendor showed me a benchmark" (trusting without verifying)

The mindset: A good benchmark should be designed to find the truth, even if that truth is "graph databases aren't right for this use case."

What to Measure

Key performance metrics:

  1. Query latency: How long does one query take?
  2. Median (p50): Typical case
  3. 95th percentile (p95): Most users' experience
  4. 99th percentile (p99): Worst typical case
  5. Max: Absolute worst case

  6. Query throughput: How many queries per second (QPS) can the system handle?

  7. Under light load (10 concurrent users)
  8. Under realistic load (100 concurrent users)
  9. Under stress (1000+ concurrent users)

  10. Scalability: How does performance change as data grows?

  11. 1 million nodes
  12. 10 million nodes
  13. 100 million nodes
  14. 1 billion nodes

  15. Resource usage:

  16. CPU utilization
  17. Memory consumption
  18. Disk I/O
  19. Network bandwidth (for distributed systems)

Measuring latency in Cypher:

1
2
3
4
// Use PROFILE to see actual execution time
PROFILE
MATCH (alice:Person {name: "Alice"})-[:FRIEND_OF*2]-(fof)
RETURN count(DISTINCT fof);

External benchmarking: For accurate measurements, use external tools that measure end-to-end latency including network overhead: - Apache JMeter - Gatling - Custom scripts with time measurements

Synthetic Benchmarks: Controlled Testing

Synthetic benchmarks use generated data and predefined queries to test specific performance characteristics.

Advantages: - Controlled: You control data size, structure, and query patterns - Reproducible: Others can replicate your results - Scalable: Easy to generate 1M, 10M, 100M node datasets - Comparable: Industry-standard benchmarks let you compare across systems

Disadvantages: - Artificial: May not reflect your real workload - Optimizable: Vendors can tune for specific benchmarks - Misleading: High synthetic performance doesn't guarantee real-world performance

When to use synthetic benchmarks: - Comparing different graph databases - Testing specific features (shortest path, variable-length paths) - Capacity planning (how does it scale?) - Proving/disproving specific performance claims

Example synthetic workload:

1
2
3
4
5
6
7
8
Dataset: 10 million Person nodes, 100 million FRIEND_OF edges
Queries:
  Q1: Find person by name (index lookup)
  Q2: Find person's friends (1-hop traversal)
  Q3: Find friends of friends (2-hop traversal)
  Q4: Find shortest path between two people
  Q5: Count mutual friends between two people
  Q6: Find people within 3 hops

Single-Node Benchmarks: Testing One Server

Single-node benchmarks test performance on a single database server (not distributed).

What they measure: - Raw query performance - Index efficiency - Memory management - Single-machine limits

Example setup: - Hardware: 64GB RAM, 16 CPU cores, SSD storage - Database: Neo4j Community Edition, single instance - Dataset: LDBC SNB SF10 (10GB scale factor) - Workload: Interactive queries (1-7 hops)

Why single-node benchmarks matter: Most applications start with a single database server. Understanding single-node performance helps you know when you need to scale out.

Benchmark best practice: Always specify hardware specs when publishing results. "Graph DB is faster" means nothing without context. "Graph DB on 64GB machine vs PostgreSQL on 64GB machine" is actionable data.

Multi-Node Benchmarks: Testing Distributed Systems

Multi-node benchmarks test performance on distributed database clusters.

What they measure: - Horizontal scalability (add more servers = more performance?) - Network overhead - Distributed query coordination - Fault tolerance

Example setup: - Cluster: 10 nodes × 64GB RAM each - Database: TigerGraph distributed cluster - Dataset: Graph500 scale 30 (1 billion edges) - Workload: BFS, PageRank, Connected Components

Key metrics: - Speedup: How much faster is 10 nodes vs 1 node? - Linear speedup: 10× faster (ideal) - Sub-linear: 5× faster (realistic) - Worse: 2× faster (poor scaling)

  • Efficiency: Speedup / Number of nodes
  • 100% efficiency: 10 nodes = 10× faster
  • 50% efficiency: 10 nodes = 5× faster
  • 20% efficiency: 10 nodes = 2× faster (not good)

Why multi-node benchmarks matter: If you're considering a distributed graph database for billion-node graphs, you need to know if it actually scales or just adds operational complexity.

Industry-Standard Benchmarks: The Tests Everyone Uses

Instead of creating benchmarks from scratch, you can use industry-standard benchmarks that have been carefully designed and widely adopted.

LDBC SNB: The Social Network Benchmark

LDBC Social Network Benchmark (SNB) is the gold standard for graph database benchmarking. It simulates a social network like Facebook or LinkedIn.

What it includes: - Data generator: Creates synthetic social network data at various scales - Workloads: - Interactive (BI): Short, frequent queries (OLTP-style) - Business Intelligence: Complex analytical queries (OLAP-style) - Graph Analytics: Algorithm benchmarks

Scale factors: - SF1: ~1GB data (~1M persons) - SF10: ~10GB data (~10M persons) - SF100: ~100GB data (~100M persons) - SF1000: ~1TB data (~1B persons)

Example queries: - IC1: Find friends and recent posts (1-2 hops) - IC2: Find recent messages from friends (1-hop + filtering) - IC3: Find friends who know specific people in specific locations (2-3 hops) - IC13: Find shortest path between two people

Why LDBC SNB is valuable: 1. Realistic: Based on actual social network patterns 2. Comprehensive: Tests many query types 3. Scalable: Multiple scale factors 4. Comparable: Published results from Neo4j, TigerGraph, Amazon Neptune, etc. 5. Audited: Results can be officially audited for fairness

Running LDBC SNB:

1
2
3
4
5
6
7
8
# Generate SF10 dataset
./ldbc_snb_datagen --scale-factor 10

# Load into Neo4j
./ldbc_snb_loader --database neo4j --data-dir ./sf10

# Run interactive workload
./ldbc_snb_driver --database neo4j --workload interactive

Benchmark credibility: If you publish LDBC SNB results, people will take them seriously. It's the industry standard.

Graph 500: The Supercomputer Benchmark

Graph 500 is a benchmark for very large-scale graph processing, originally designed for supercomputers and HPC systems.

What it measures: - Breadth-First Search (BFS): Starting from a source vertex, visit all reachable vertices - Single-Source Shortest Paths (SSSP): Find shortest paths from one vertex to all others - Performance metric: Traversed Edges Per Second (TEPS)

Scale: - Scale 20: ~1 million vertices, ~10 million edges - Scale 30: ~1 billion vertices, ~17 billion edges - Scale 40: ~1 trillion vertices, ~17 trillion edges

Example problem:

1
2
3
4
5
6
Given: Graph with 1 billion vertices, 17 billion edges
Task: Perform BFS from a random starting vertex
Metric: How many edges per second can you traverse?

Top systems: ~100 billion TEPS
Typical graph DB: ~1-10 million TEPS (very different scale!)

Why Graph 500 matters: It shows the absolute limits of graph processing. Supercomputers achieve billions of TEPS. Graph databases achieve millions of TEPS but with much more query flexibility.

When to use Graph 500: - Testing distributed graph databases - Comparing graph algorithms (BFS, PageRank, etc.) - Understanding theoretical limits - Not useful for typical CRUD operations

Benchmark insight: Graph 500 shows that specialized HPC systems are faster for pure graph algorithms, but graph databases offer rich query languages, ACID transactions, and operational features that HPC systems don't.

Query Cost Analysis: Understanding Performance Trade-offs

Let's get quantitative about why graph databases outperform RDBMS for relationship queries.

Join Operations: The RDBMS Approach

Join operations are how RDBMS systems traverse relationships. Each hop requires a JOIN.

Example: 3-hop friend query in SQL

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
-- Find friends of friends of friends
SELECT DISTINCT p4.name
FROM persons p1
JOIN friendships f1 ON p1.id = f1.person1_id
JOIN persons p2 ON f1.person2_id = p2.id
JOIN friendships f2 ON p2.id = f2.person1_id
JOIN persons p3 ON f2.person2_id = p3.id
JOIN friendships f3 ON p3.id = f3.person1_id
JOIN persons p4 ON f3.person2_id = p4.id
WHERE p1.name = 'Alice';

Cost analysis:

Per JOIN operation: 1. Find rows in first table: O(log n) with index, O(n) without 2. For each row, look up matching rows in second table: O(m log k) where m = rows from first table, k = rows in second table 3. Build result set: O(m × r) where r = average matches per row

For 3 hops: - First JOIN: O(n₁ log n₂) - Second JOIN: O(intermediate₁ log n₃) - Third JOIN: O(intermediate₂ log n₄)

Problem: Intermediate result sets grow exponentially with hops. On a social network: - 1 hop: 200 friends (200 rows) - 2 hops: 200 × 200 = 40,000 rows (intermediate set) - 3 hops: 40,000 × 200 = 8,000,000 rows (intermediate set)

Database must: - Build intermediate result sets in memory (or spill to disk—even slower) - Perform nested loop joins or hash joins (both expensive) - Filter duplicates at the end

Why it's slow: The work grows exponentially with hops, even with indexes.

Traversal Cost: The Graph Database Approach

Traversal in graph databases follows pointers directly from node to node.

Same 3-hop query in Cypher:

1
2
MATCH (alice:Person {name: "Alice"})-[:FRIEND_OF*3]-(connection)
RETURN DISTINCT connection.name;

Cost analysis:

Per hop: 1. Find starting node: O(log n) with index 2. Read relationship pointers: O(1) per relationship (index-free adjacency) 3. Follow pointer to next node: O(1) per hop 4. Check filters: O(1) per node

For 3 hops with 200 friends per person: - Find Alice: O(log n) ≈ 20 operations (with index on 1M nodes) - Traverse 1st hop: 200 pointer lookups = 200 operations - Traverse 2nd hop: 200 × 200 = 40,000 operations - Traverse 3rd hop: 40,000 × 200 = 8,000,000 operations

Wait, that's the same number! Why is it faster?

The difference: 1. No intermediate result sets: Graph DBs don't build 40,000-row tables in memory 2. Pointer lookups vs index lookups: O(1) pointer dereference vs O(log n) index seek 3. Locality of reference: Graph data stored together, better cache performance 4. No JOIN overhead: No hash tables, no nested loops, no sort-merge

In practice: - RDBMS 3-hop: 3-10 seconds (with indexes) - Graph DB 3-hop: 10-50 milliseconds

That's 100-1000× faster.

Why? Constant factors matter. O(1) pointer lookup vs O(log n) index lookup, multiplied by millions of operations, creates huge differences.

Comparative Benchmark: RDBMS vs Graph DB

Let's look at real benchmark numbers (these are representative of published results):

Dataset: Social network, 10 million people, 100 million friendships

Query: Find friends up to N hops away

Hops RDBMS (PostgreSQL) Graph DB (Neo4j) Speedup
1 12 ms 5 ms 2.4×
2 185 ms 7 ms 26×
3 3,400 ms 11 ms 309×
4 58,000 ms 14 ms 4,142×
5 timeout (>10 min) 18 ms >33,000×

Observations: 1. 1-hop: Graph DB only 2× faster (both are fast) 2. 2-3 hops: Graph DB 10-300× faster (noticeable difference) 3. 4+ hops: Graph DB 1000s of times faster (RDBMS becomes unusable)

The performance cliff: Around 2-3 hops, RDBMS performance falls off a cliff. Graph databases maintain near-constant performance.

Benchmark credibility: These numbers match published LDBC results and vendor benchmarks. You can replicate them yourself.

Scalability: How Systems Grow

Scalability measures how performance changes as data size or load increases.

Types of Scalability

Vertical scaling (scale up): - Add more resources to one server (CPU, RAM, SSD) - Limits: Single machine limits (~1TB RAM, ~100 CPU cores) - Cost: Expensive (high-end servers cost $$$$)

Horizontal scaling (scale out): - Add more servers to a cluster - Limits: Coordination overhead, network bandwidth - Cost: Cheaper per node, but operational complexity

Graph databases support both: - Single-node: Neo4j, Amazon Neptune (serverless), Memgraph - Distributed: TigerGraph, Neo4j Aura Enterprise, JanusGraph, Dgraph

Measuring Scalability

Ideal scaling: - 2× data → same query time (constant performance) - 2× servers → 2× throughput (linear horizontal scaling)

Real-world scaling: - 2× data → 1.5× query time (sub-linear degradation—acceptable) - 2× servers → 1.6× throughput (80% efficiency—good)

Scalability benchmark:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
Test: Measure query latency as data grows

Dataset sizes: 1M, 10M, 100M, 1B nodes
Query: Find friends within 3 hops

Results:
  1M nodes: 5 ms
  10M nodes: 8 ms (1.6× slower for 10× data—excellent)
  100M nodes: 15 ms (3× slower for 100× data—good)
  1B nodes: 45 ms (9× slower for 1000× data—acceptable)

Log-scale growth: Query time grows logarithmically with data size (due to index lookups). This is excellent scalability.

Poor scalability example:

1
2
3
4
5
6
RDBMS 3-hop query as data grows:

  1M nodes: 200 ms
  10M nodes: 3,000 ms (15× slower)
  100M nodes: 60,000 ms (300× slower)
  1B nodes: timeout

Linear growth: Query time grows linearly or worse with data size. This doesn't scale.

Benchmark Best Practices: Doing It Right

Now that you understand the metrics and benchmarks, let's talk about how to run fair, credible benchmarks.

1. Match the hardware: - Same CPU, RAM, SSD across all systems tested - Document specs: "64GB RAM, 16 cores, 1TB NVMe SSD"

2. Use appropriate indexes: - Don't compare indexed Graph DB to non-indexed RDBMS - Create indexes that match query patterns - Report which indexes you created

3. Warm up the cache: - Run queries multiple times before measuring - Report cold-cache and warm-cache performance separately

4. Measure realistic workloads: - Don't just test best-case queries - Include complex queries that stress the system - Test at realistic concurrency levels

5. Report distributions, not just averages: - Median (p50), 95th percentile (p95), 99th percentile (p99) - Max latency matters for user experience

6. Test at realistic data sizes: - If production will have 100M nodes, test at 100M nodes - Synthetic benchmarks at 1M nodes don't predict 100M node performance

7. Document everything: - Database versions - Configuration settings (cache size, query timeout, etc.) - Data model (schema, indexes) - Query text - Hardware specs - Methodology (how you generated load, measured latency, etc.)

8. Share your results: - Publish to GitHub - Write a blog post - Present at meetups - Let others replicate and verify

Building Your Reputation: The Data-Driven Engineer

Here's why all this benchmarking knowledge matters for your career:

Scenario 1: The Skeptical Tech Lead

You propose using a graph database for the new recommendation engine.

Tech Lead: "Graph databases are just hype. SQL works fine."

You (without benchmarks): "But... the vendor said it's 1000× faster..."

Result: Dismissed as naive.


You (with benchmarks): "I benchmarked our data model—10M users, 100M relationships. Here's what I found:"

Query PostgreSQL Neo4j Speedup
2-hop friends 180 ms 7 ms 25×
3-hop recommendations 3.2 sec 11 ms 290×

"I ran LDBC SNB at SF10 scale. Graph DB stays under 50ms for all queries. PostgreSQL timeouts at 3 hops. Here's the GitHub repo with my test scripts."

Result: Tech lead respects you. You get to run a proof-of-concept.


Scenario 2: The Budget Discussion

CFO: "Why do we need to spend $50k/year on a graph database?"

You (without data): "It's faster...?"

Result: Budget denied.


You (with data): "Our current recommendation engine runs overnight batch jobs—12-hour runtime. Users see stale recommendations. I benchmarked a graph database: same recommendations in 50 milliseconds, real-time. We can retire 20 batch processing servers ($80k/year cost). ROI is positive year one."

Result: Budget approved. You're a hero.


Scenario 3: The Conference Talk

You: "I compared graph databases to relational databases for social network queries. Here are my LDBC SNB results, reproduced on identical hardware. You can replicate my benchmarks using this GitHub repo."

Result: Your talk gets accepted. Your blog post gets shared. Recruiters message you. Your reputation as a rigorous, data-driven engineer grows.


The pattern: Benchmarking isn't just about measuring databases. It's about building credibility. Engineers who bring data to discussions are respected. Engineers who trust vendor marketing are not.

Your reputation is built on moments like these. When you can say, "I tested it, here are the numbers," you become someone whose opinion matters.

Running Your Own Benchmarks: A Practical Guide

Ready to actually run some benchmarks? Here's how to get started.

Step 1: Define Your Questions

Don't benchmark randomly. Start with specific questions:

Good questions: - "Can a graph database handle our 3-hop recommendation query in under 100ms?" - "How does Neo4j compare to PostgreSQL for our specific data model?" - "Will query performance degrade as we grow from 10M to 100M nodes?"

Bad questions: - "Which database is faster?" (too vague) - "Is Neo4j good?" (not measurable)

Step 2: Model Your Data

Create a representative data model:

1
2
3
4
5
6
7
// Example: E-commerce graph
CREATE (:Customer {id: 1, name: "Alice", city: "Seattle"})
CREATE (:Product {id: 101, name: "Laptop", price: 899})
CREATE (:Category {id: 1, name: "Electronics"})

CREATE (c:Customer {id: 1})-[:PURCHASED {date: "2024-01-15", price: 899}]->(p:Product {id: 101})
CREATE (p)-[:IN_CATEGORY]->(cat:Category {id: 1})

Scale it: Use LDBC data generator or write scripts to generate millions of nodes.

Step 3: Write Benchmark Queries

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
// Q1: Find customer by ID (index lookup)
MATCH (c:Customer {id: $customerId})
RETURN c;

// Q2: Find customer's purchases (1-hop)
MATCH (c:Customer {id: $customerId})-[:PURCHASED]->(p:Product)
RETURN p;

// Q3: Find similar customers (2-hop)
MATCH (c:Customer {id: $customerId})-[:PURCHASED]->(p:Product)<-[:PURCHASED]-(similar)
RETURN similar, count(p) AS commonProducts
ORDER BY commonProducts DESC
LIMIT 10;

// Q4: Product recommendations (3-hop)
MATCH (c:Customer {id: $customerId})-[:PURCHASED]->(:Product)<-[:PURCHASED]-(similar)
     -[:PURCHASED]->(rec:Product)
WHERE NOT (c)-[:PURCHASED]->(rec)
RETURN rec, count(similar) AS score
ORDER BY score DESC
LIMIT 10;

Step 4: Measure Performance

Use Python with the Neo4j driver:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
from neo4j import GraphDatabase
import time
import statistics

driver = GraphDatabase.driver("bolt://localhost:7687", auth=("neo4j", "password"))

def benchmark_query(query, params, iterations=100):
    latencies = []

    with driver.session() as session:
        # Warm up
        for _ in range(10):
            session.run(query, params)

        # Measure
        for _ in range(iterations):
            start = time.time()
            session.run(query, params)
            latencies.append((time.time() - start) * 1000)  # ms

    return {
        "p50": statistics.median(latencies),
        "p95": statistics.quantiles(latencies, n=20)[18],  # 95th percentile
        "p99": statistics.quantiles(latencies, n=100)[98],  # 99th percentile
        "mean": statistics.mean(latencies)
    }

# Run benchmark
params = {"customerId": 12345}
results = benchmark_query("MATCH (c:Customer {id: $customerId})-[:PURCHASED]->(p) RETURN p", params)
print(f"P50: {results['p50']:.2f}ms, P95: {results['p95']:.2f}ms, P99: {results['p99']:.2f}ms")

Step 5: Compare Alternatives

Run the same queries on PostgreSQL:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
import psycopg2
import time

conn = psycopg2.connect("dbname=ecommerce user=postgres")
cursor = conn.cursor()

def benchmark_sql(query, params, iterations=100):
    latencies = []

    # Warm up
    for _ in range(10):
        cursor.execute(query, params)
        cursor.fetchall()

    # Measure
    for _ in range(iterations):
        start = time.time()
        cursor.execute(query, params)
        cursor.fetchall()
        latencies.append((time.time() - start) * 1000)

    return {
        "p50": statistics.median(latencies),
        "p95": statistics.quantiles(latencies, n=20)[18],
        "mean": statistics.mean(latencies)
    }

# Same query in SQL
sql = """
    SELECT p.* FROM customers c
    JOIN purchases pur ON c.id = pur.customer_id
    JOIN products p ON pur.product_id = p.id
    WHERE c.id = %s
"""
results = benchmark_sql(sql, (12345,))
print(f"PostgreSQL P50: {results['p50']:.2f}ms, P95: {results['p95']:.2f}ms")

Step 6: Analyze and Report

Create a comparison table:

Query PostgreSQL P50 Neo4j P50 Speedup
Find customer 2.1 ms 1.8 ms 1.2×
Customer purchases (1-hop) 8.5 ms 3.2 ms 2.7×
Similar customers (2-hop) 145 ms 6.8 ms 21×
Recommendations (3-hop) 2,800 ms 12 ms 233×

Visualize the results: - Graph showing latency vs hop count - Bar chart comparing databases - Line graph showing scalability (latency vs data size)

Step 7: Share Your Findings

Write it up: - Blog post - GitHub repo with scripts - Internal tech doc

Include: - Methodology (how you tested) - Data model (schema) - Hardware specs - Database versions and configuration - Full results (not just cherry-picked wins) - Limitations (what you didn't test)

Be honest: If RDBMS won on some queries, say so. Credibility comes from honesty, not cherry-picking.

Key Takeaways

  1. Skepticism is good engineering: Question claims, test assumptions, measure reality
  2. Hop count predicts performance: Graph DBs scale linearly, RDBMS hits exponential cliff at 2-3 hops
  3. Degree matters: High-degree nodes (hubs) can slow queries—test both average and extreme cases
  4. Indexes are critical: Don't benchmark without appropriate indexes on both systems
  5. Use standard benchmarks: LDBC SNB and Graph 500 provide credible, comparable results
  6. Join operations are expensive: O(n log n) per hop with growing intermediate sets
  7. Traversal is constant-time: O(1) pointer lookups via index-free adjacency
  8. Measure distributions, not averages: P50, P95, P99 tell the real story
  9. Test at realistic scale: 1M node benchmarks don't predict 100M node performance
  10. Document everything: Reproducibility = credibility
  11. Build your reputation: Data-driven engineers earn respect
  12. Share your results: Transparency builds trust

The bottom line: Don't trust vendor claims. Don't trust this textbook. Run your own benchmarks. Measure your data, your queries, your workload. Make decisions based on evidence, not marketing.

When you walk into a meeting with benchmark results—your own data, reproducible methodology, honest reporting—people listen. That's how you build a reputation as an engineer who makes smart, data-driven decisions.

And that's worth more than any technology choice.


Remember: The best engineers aren't the ones who know the right answer. They're the ones who know how to find the right answer through rigorous testing and measurement. Go benchmark something.