Query Languages for Graph Databases
Summary
This chapter provides comprehensive coverage of graph query languages including OpenCypher, GSQL, and the emerging GQL standard. You'll master Cypher syntax for pattern matching, learn how to construct complex graph queries with match, where, and return clauses, and explore GSQL's map-reduce pattern for distributed query processing. The chapter emphasizes both declarative and imperative query approaches, query optimization techniques, and performance considerations for production graph applications.
Concepts Covered
This chapter covers the following 26 concepts from the learning graph:
- OpenCypher
- GSQL
- Statistical Query Tuning
- GQL
- Cypher Syntax
- Match Clause
- Where Clause
- Return Clause
- Create Statement
- Merge Statement
- Delete Statement
- Set Clause
- Graph Patterns
- Variable Length Paths
- Shortest Path
- All Paths
- Map-Reduce Pattern
- Accumulators
- Query Optimization
- Query Performance
- Query Latency
- Query Throughput
- Declarative Queries
- Imperative Queries
- Query Plans
- Shortest Path Algorithms
Prerequisites
This chapter builds on concepts from:
The Elephant in the Room (Or Should We Say, the AI in the Cloud?)
Let's address something right up front: By the time you're reading this, AI capabilities are doubling roughly every seven months. There's a decent chance that by the time you're actually working with graph databases professionally, you'll just describe what you want in plain English, and an AI will write the query for you. "Hey AI, find all customers who bought products similar to Alice's purchases in the last month." Done.
So why are we about to spend an entire chapter learning Cypher syntax, GSQL patterns, and query optimization techniques?
Here's the thing (and we're giving you a knowing wink here): Understanding what the code does is valuable even if you never write it yourself. When the AI generates a query that returns 10 million nodes instead of the 10 you expected, you'll want to know why. When a query that should take milliseconds is taking minutes, you'll need to spot the problem. When you're reviewing what the AI suggested and something looks... off... you'll want the knowledge to catch it.
Think of it like learning to drive even though self-driving cars exist. Sure, the car might handle 99% of the driving, but you still want to know what's happening when you press the brake, right?
So yes, AI might write most of your graph queries in the future. But this chapter will teach you to read them, understand them, debug them, and—when necessary—write them yourself. Consider it "AI literacy for graph databases."
Ready? Let's dive into graph query languages. And remember: every time you think "I'll never write this manually," imagine your future self saying "Thank goodness I learned this" when the AI suggests a query that would accidentally delete your entire production database. (We're joking. Mostly.)
Query Languages: The Big Three (and the Future)
Before we dive into syntax, let's survey the landscape. There are three major query languages you should know about, plus a fourth that's emerging as a standard.
OpenCypher: The People's Champion
OpenCypher is the most popular graph query language, originally developed by Neo4j and then open-sourced. It's declarative (you describe what you want, not how to get it), highly readable, and looks a bit like ASCII art of graphs.
Why it's popular:
- Visual syntax: (alice:Person)-[:FRIEND_OF]->(bob) literally looks like a graph
- Declarative: Focus on the pattern you want, not the algorithm to find it
- Wide adoption: Neo4j, Amazon Neptune, Memgraph, RedisGraph, and many others
- Active community: Lots of resources, tutorials, and Stack Overflow answers
Example:
1 2 3 | |
Even if you've never seen Cypher before, you can probably guess what this does: Find Alice, find her friends, return their names and ages sorted by age.
GSQL: The Distributed Powerhouse
GSQL (Graph SQL) is TigerGraph's query language, designed for massive-scale distributed graph processing. While Cypher is declarative, GSQL blends declarative and imperative styles, giving you fine-grained control over execution.
Why it matters: - Imperative control: You can specify exactly how to process the graph - Map-reduce pattern: Built for distributed computation across clusters - Accumulators: Powerful constructs for aggregating data during traversal - Performance tuning: Fine-grained control for optimizing complex queries
Example:
1 2 3 4 5 6 7 | |
GSQL looks more like traditional programming—you define variables, specify control flow, and manage execution explicitly.
GQL: The Emerging Standard
GQL (Graph Query Language) is the ISO standard for graph queries, currently being developed by the same committee that created SQL. Think of it as "SQL for graphs."
Why you should care (eventually): - ISO standard: Official international standard, like SQL - Industry consensus: Major vendors collaborating on unified syntax - Future-proof: Learning GQL means learning the future lingua franca of graphs - SQL familiarity: Designed to feel familiar to SQL developers
Current status: Still emerging (as of 2025). Neo4j, Oracle, and other vendors are implementing support, but it's not yet as mature as Cypher or GSQL.
What it looks like:
1 2 3 | |
Familiar, right? It's intentionally similar to Cypher but with SQL-like syntax elements.
Which One Should You Learn?
Short answer: Start with Cypher. It's the most widely used, has the best tutorials, and concepts transfer easily to other languages.
Long answer: Cypher will teach you graph thinking. GSQL will teach you performance optimization. GQL will prepare you for the future. Ideally, know enough Cypher to read and write basic queries, understand GSQL concepts for distributed systems, and keep an eye on GQL for standardization.
And remember: The AI will probably write queries in whatever language your database supports. Your job is to understand what it wrote. 😉
Cypher Syntax: The ASCII Art of Graph Queries
Let's dive deep into Cypher, the most popular graph query language. We'll cover enough that when an AI (or a colleague, or your future self) writes a Cypher query, you'll know exactly what's happening.
The Core Philosophy: Drawing Graphs with Text
Cypher's genius is visual syntax. Compare these:
What you're thinking:
1 | |
What you write:
1 | |
See the similarity? Nodes in parentheses (), relationships in brackets [], arrows showing direction ->. It's ASCII art that happens to be executable code.
Nodes: The Parentheses Pattern
Nodes are always wrapped in parentheses:
1 2 3 4 5 6 | |
Breaking down the anatomy:
- (variable:Label {property: value})
- Variable (optional): Lets you refer to the node later in the query
- Label (optional): The type/category of node
- Properties (optional): Key-value pairs to match or filter
Relationships: The Bracket and Arrow Pattern
Relationships use brackets and arrows:
1 2 3 4 5 6 7 | |
Direction matters (usually):
- (alice)-[:PURCHASED]->(product) - Alice purchased product ✅
- (product)-[:PURCHASED]->(alice) - Product purchased Alice? ❌ (semantically weird)
But you can traverse backwards:
- (alice)<-[:PURCHASED]-(product) - Products purchased by Alice (same data, viewed backwards)
The Five Essential Clauses
Cypher queries are built from clauses, like SQL. Here are the five you'll use constantly:
- MATCH - Find patterns in the graph
- WHERE - Filter results
- RETURN - Specify what to output
- CREATE - Add new data
- DELETE - Remove data
Let's explore each in detail.
MATCH Clause: Finding Patterns
The MATCH clause is the heart of Cypher queries. It describes a pattern you want to find in the graph.
Simple matching:
1 2 3 4 5 6 7 8 9 10 11 | |
Multi-hop matching:
1 2 3 4 5 6 7 | |
Multiple patterns:
1 2 3 4 | |
Optional patterns:
1 2 3 4 | |
The abstract concept: MATCH is declarative pattern matching. You describe the shape you want, and the query engine finds all instances of that shape in your graph.
The practical reality: When you tell an AI "find Alice's friends," it writes a MATCH clause. When the query takes too long, you'll look at the MATCH clause to see if it's searching too broadly.
WHERE Clause: Filtering Results
The WHERE clause filters matches based on conditions, just like SQL.
Property filtering:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | |
Relationship filtering:
1 2 3 4 | |
Null checking:
1 2 3 4 | |
List operations:
1 2 3 4 | |
Pattern-based filtering:
1 2 3 4 | |
Pro tip: You can often put properties directly in MATCH (p:Person {age: 30}) instead of using WHERE WHERE p.age = 30. They're equivalent, but WHERE is more flexible for complex conditions.
RETURN Clause: Shaping Output
The RETURN clause specifies what data you want back from the query.
Basic returns:
1 2 3 4 5 6 7 8 9 10 11 | |
Returning relationships:
1 2 3 4 5 6 7 | |
Computed values:
1 2 3 4 5 6 7 | |
Aggregations:
1 2 3 4 5 6 7 8 9 10 11 12 | |
Sorting and limiting:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | |
DISTINCT results:
1 2 3 4 5 6 7 | |
CREATE Statement: Adding Data
The CREATE statement adds new nodes and relationships to the graph.
Creating nodes:
1 2 3 4 5 6 7 8 | |
Creating relationships:
1 2 3 4 5 6 7 | |
Creating patterns:
1 2 3 4 5 6 7 | |
Returning created data:
1 2 | |
Important warning: CREATE always creates new nodes/relationships, even if they already exist. If you run the same CREATE statement twice, you'll get duplicates. That's where MERGE comes in...
MERGE Statement: Create or Match
The MERGE statement is like "create if it doesn't exist, otherwise match." It's idempotent—running it multiple times has the same effect as running it once.
Basic MERGE:
1 2 3 4 5 6 7 | |
MERGE with ON CREATE:
1 2 3 4 | |
MERGE with ON MATCH:
1 2 3 4 | |
MERGE with both:
1 2 3 4 | |
MERGE relationships:
1 2 3 4 5 6 | |
Why MERGE matters: When loading data from external sources, you often don't know if nodes already exist. MERGE handles this gracefully—no duplicates, no errors.
When the AI uses MERGE: If you ask an AI to "make sure Alice is friends with Bob," it should use MERGE, not CREATE. If it uses CREATE, you might end up with 50 duplicate FRIEND_OF relationships. Now you know to spot that!
SET Clause: Updating Properties
The SET clause modifies properties on existing nodes and relationships.
Setting properties:
1 2 3 4 5 6 7 8 9 | |
Adding labels:
1 2 3 4 | |
Copying properties:
1 2 3 4 5 | |
Conditional updates:
1 2 3 4 5 | |
Updating relationship properties:
1 2 3 4 | |
DELETE Statement: Removing Data
The DELETE statement removes nodes and relationships from the graph.
Deleting relationships:
1 2 3 4 5 6 7 | |
Deleting nodes:
1 2 3 4 5 6 7 | |
Conditional deletion:
1 2 3 4 | |
Deleting all data (use with extreme caution!):
1 2 3 | |
Why DETACH DELETE exists: In graph databases, you can't have relationships pointing to non-existent nodes. If you try to DELETE a node that has relationships, the database will throw an error. DETACH DELETE removes the relationships first, then the node.
When the AI might get this wrong: If the AI tries to DELETE a node without DETACH, the query will fail. Now you'll know to add DETACH. (See? Understanding syntax helps even when AI writes code!)
Graph Patterns: The Power of Structure
Graph patterns are the core of Cypher queries—they describe shapes you want to find in your graph.
Basic patterns:
1 2 3 4 5 6 7 8 9 10 11 12 13 | |
Pattern with multiple relationship types:
1 2 3 4 | |
Patterns with constraints:
1 2 3 4 5 | |
Anti-patterns (patterns that should NOT exist):
1 2 3 4 | |
Why patterns matter: Patterns let you express complex graph queries concisely. Finding triangles (3-way relationships) in SQL would require multiple self-joins. In Cypher, it's one visual pattern.
Variable-Length Paths: Following the Rabbit Hole
Variable-length paths let you traverse relationships without knowing how many hops you need.
Basic syntax:
1 2 3 4 5 | |
Real examples:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | |
Why variable-length paths are powerful: They let you explore network effects, influence propagation, recommendation chains, and supply chain dependencies without knowing the exact number of hops beforehand.
Performance warning: Variable-length paths can be expensive. [:FRIEND_OF*] with no upper limit might traverse millions of relationships. Always set an upper bound (*1..5) unless you have a very good reason not to.
When the AI might mess this up: If the AI writes [:FRIEND_OF*] without a limit on a large graph, the query might run forever. Understanding this helps you spot the issue.
Shortest Path: Finding the Quickest Route
Shortest path finds the minimal-hop route between two nodes.
Basic shortest path:
1 2 3 4 | |
All shortest paths:
1 2 3 4 | |
Shortest path with relationship type filter:
1 2 3 4 | |
Shortest path with max length:
1 2 3 4 | |
Real-world use cases: - Social networks: Six degrees of separation, connection suggestions - Supply chains: Find fastest route from manufacturer to customer - Network routing: Shortest path between network nodes - Knowledge graphs: How are two concepts related?
The abstract concept: Shortest path algorithms (Dijkstra's, BFS) find minimal-cost routes through graphs. Cypher abstracts this complexity into simple syntax.
All Paths: When You Need Every Route
All paths returns every possible route between two nodes (use carefully—can be huge!).
Basic syntax:
1 2 3 4 | |
Filtered paths:
1 2 3 4 5 6 | |
Why you'd want all paths: - Redundancy analysis: How many ways can information flow? - Risk assessment: If one connection fails, what are the alternatives? - Network analysis: Understanding structural properties of graphs
Why all paths is dangerous: On a densely connected graph, the number of paths can grow exponentially. ALWAYS use LIMIT and max-length constraints.
When the AI might overuse this: If you ask "how is Alice related to Bob," a naive AI might use all paths, returning millions of results. Shortest path is usually better.
Declarative vs. Imperative Queries
One of the key concepts in query languages is the difference between declarative and imperative approaches.
Declarative queries (Cypher's style): - You describe what you want, not how to find it - The database figures out the optimal execution plan - Easier to write, harder to optimize manually
Example:
1 2 3 4 | |
You didn't specify: - Which index to use - Which node to start from - What traversal algorithm to use - How to filter results
The query planner handles all that.
Imperative queries (GSQL's style): - You specify how to execute the query step-by-step - More control over execution, more code complexity - Useful for optimizing complex queries on massive graphs
Example (GSQL):
1 2 3 4 5 6 7 8 9 10 11 12 | |
Here you explicitly:
- Define accumulators (@@count)
- Specify traversal start (Start = {alice})
- Control execution flow
Which is better? Neither—they're different tools for different jobs: - Declarative for most queries, rapid development, standard use cases - Imperative for performance-critical queries, complex aggregations, distributed processing
AI implication: Most AI systems will generate declarative queries (Cypher) because they're simpler and more portable. If you need imperative control (GSQL), you might need to guide the AI more specifically.
GSQL and the Map-Reduce Pattern
Let's talk about GSQL and why TigerGraph designed a different approach.
Why GSQL Exists
Cypher is great for small-to-medium graphs (millions of nodes). But when you hit billions of nodes and trillions of relationships across distributed clusters, declarative queries can struggle with optimization. GSQL was designed for this scale.
GSQL's map-reduce pattern processes graphs in stages:
- Map: Transform each vertex/edge
- Reduce: Aggregate results
- Repeat: Iterate until convergence
This mirrors big data processing frameworks (Hadoop MapReduce, Spark), but optimized for graphs.
Accumulators: GSQL's Secret Weapon
Accumulators are variables that collect data during graph traversal.
Types of accumulators:
- SumAccum<INT> - Sum integers
- AvgAccum - Calculate averages
- MaxAccum / MinAccum - Track max/min values
- ListAccum<STRING> - Collect lists
- SetAccum<VERTEX> - Collect unique vertices
Example:
1 2 3 4 5 6 7 8 9 10 11 12 13 | |
Why accumulators matter: They let you aggregate data during traversal, not after. This is much faster on distributed systems because you're not shuffling data across network.
Cypher equivalent (less efficient at scale):
1 2 3 | |
Both produce the same result, but on a 10-billion-edge graph, GSQL's accumulator approach can be orders of magnitude faster.
When you'd use GSQL: - Billion+ node graphs - Distributed processing across clusters - Complex multi-hop aggregations - Graph algorithms (PageRank, community detection, centrality) - Real-time fraud detection at scale
When you'd stick with Cypher: - Graphs under 100 million nodes - Standard CRUD operations - Rapid development - Team familiarity with declarative SQL-like syntax
Query Optimization: Making Queries Fast
Understanding query optimization helps you read query plans and spot performance issues.
How Query Planners Work
When you write a Cypher query, the database doesn't execute it literally. It:
- Parses the query into an abstract syntax tree
- Optimizes by rewriting into equivalent but faster forms
- Generates a query plan specifying execution order
- Executes the plan
View the query plan:
1 2 3 4 | |
This shows you what the database will do without actually running the query.
Analyze actual execution:
1 2 3 4 | |
PROFILE runs the query and shows actual row counts, execution time per step.
Common Optimizations
1. Index usage:
1 2 3 4 5 6 7 | |
2. Filter early:
1 2 3 4 5 6 7 8 9 | |
3. Avoid Cartesian products:
1 2 3 4 5 6 7 8 | |
4. Use LIMIT when exploring:
1 2 3 4 5 6 7 8 | |
Query Performance Metrics
Query latency: How long does one query take? - Good: < 100ms - Acceptable: 100ms - 1s - Slow: > 1s - Fix it: > 10s
Query throughput: How many queries per second can the system handle? - Measure with: Queries per second (QPS) - Affected by: Concurrency, caching, index quality, hardware
Statistical Query Tuning: Use PROFILE to identify bottlenecks:
1 2 3 | |
Look for: - High db hits: Operations scanning too many nodes/relationships - Large row counts: Intermediate results that should be filtered earlier - Missing index usage: Scans instead of index seeks
Shortest Path Algorithms: Under the Hood
You've used shortestPath() in Cypher, but what's actually happening?
Breadth-First Search (BFS)
How it works: 1. Start at source node 2. Explore all neighbors (1-hop away) 3. Explore all neighbors' neighbors (2-hops away) 4. Continue until target found
Why it finds shortest paths: BFS explores layer by layer, so first time it reaches the target is guaranteed to be the shortest path (for unweighted graphs).
Cypher uses BFS for:
1 | |
Time complexity: O(V + E) where V = vertices, E = edges
Dijkstra's Algorithm
How it works: 1. Assign tentative distances to all nodes (infinity, except source = 0) 2. Visit unvisited node with smallest distance 3. Update distances to neighbors 4. Repeat until target reached
When you'd use it: Weighted graphs where relationships have costs.
Example:
1 2 3 | |
Time complexity: O((V + E) log V) with priority queue
A* Algorithm
How it works: Like Dijkstra, but uses a heuristic (estimated cost to goal) to explore promising paths first.
When you'd use it: Spatial graphs (geographic networks, routing) where you have coordinate data to estimate distances.
Example use case: Finding shortest driving route on road network graph.
Time complexity: Depends on heuristic quality, often much faster than Dijkstra in practice
Why You Care
When the AI generates:
1 | |
You now know it's running BFS. If the query is slow, you know:
- BFS is O(V + E), so it might be traversing millions of relationships
- You could limit max hops: shortestPath((a)-[:FRIEND_OF*..6]-(b))
- Or you could check if an index on Person.name exists to find a and b quickly
See? Understanding algorithms helps you debug AI-generated code!
Query Plans: Reading the Execution Blueprint
Query plans show exactly how the database will execute your query.
Get the plan without running:
1 2 3 4 5 | |
Typical plan operations:
| Operation | What It Does | Performance |
|---|---|---|
NodeByLabelScan |
Scan all nodes with a label | Slow (O(n)) |
NodeIndexSeek |
Use index to find nodes | Fast (O(log n)) |
Expand(All) |
Follow all relationships | O(degree) per node |
Filter |
Apply WHERE conditions | Depends on selectivity |
Sort |
Order results | O(n log n) |
Limit |
Take first N results | Fast |
Distinct |
Remove duplicates | O(n) |
Reading a plan:
1 2 3 4 5 6 7 8 9 10 | |
Interpretation: 1. NodeIndexSeek (bottom): Found 1 Alice node using index (2 db hits) 2. Expand(All): Followed FRIEND_OF edges, found 45 friends (90 db hits) 3. Filter: Checked age > 30, kept 12 results (45 db hits) 4. Sort: Sorted 12 results by name (24 db hits) 5. ProduceResults (top): Returned 12 rows
Red flags to look for:
- ❌ NodeByLabelScan when you expected NodeIndexSeek (missing index!)
- ❌ Huge row counts early in plan that filter down later (filter earlier!)
- ❌ CartesianProduct (accidental cross join)
- ❌ High DB hits relative to rows returned (inefficient access pattern)
When the AI's query is slow: Look at the plan. You might see it's doing a label scan instead of an index seek, meaning you need to create an index. Or it's expanding too broadly before filtering. Understanding plans = debugging superpowers.
Bringing It All Together: A Realistic Example
Let's build a complete query using everything we've learned. Imagine you're building a social network feature: "People you may know."
Requirements: - Find people who are friends with your friends (2-hop) - But exclude people you're already friends with - Prioritize people in the same city - Show people with mutual friends count - Limit to top 10 suggestions
The query:
1 2 3 4 5 6 7 8 9 10 11 12 13 | |
Breaking it down: 1. MATCH: Find 2-hop friends 2. WHERE: Filter out existing friends and self 3. WITH: Aggregate mutual friends count, prepare for sorting 4. ORDER BY: Same city first, then by mutual friend count 5. RETURN: Format output nicely 6. LIMIT: Top 10 suggestions
What an AI might get wrong:
- Forgetting suggestion <> me (suggesting yourself)
- Not using DISTINCT in count (counting same mutual friend multiple times if multiple paths exist)
- Inefficient pattern (could use variable-length path with max 2 hops for clarity)
Optimized version:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | |
Performance considerations:
- Index on Person.name makes finding Alice fast
- Filtering NOT (me)-[:FRIEND_OF]-(suggestion) happens during traversal (efficient)
- LIMIT 10 stops early—doesn't need to find all suggestions
- DISTINCT in count prevents duplicate counting
Now you can read this query, understand it, optimize it, and explain it—even if an AI wrote it.
The Bottom Line: Why Learning This Matters
We opened this chapter with a wink: "AI will probably write your queries." And that's likely true. But here's what we've learned:
- Reading code is a superpower: When the AI generates a query, you can understand what it does
- Debugging is essential: When queries fail or are slow, you can spot the issue
- Optimization requires knowledge: You can't tune what you don't understand
- Communication improves: "Use MERGE, not CREATE" is faster than explaining duplicates
- Trust but verify: You can review AI-generated code for correctness and efficiency
Think of this chapter as query language literacy. You might not write Cypher from scratch every day, but you'll read it, review it, debug it, and optimize it. And when the AI suggests something that looks wrong, you'll have the knowledge to catch it.
Final thought: AI is doubling every seven months, yes. But so is the amount of data we're storing and querying. The problems are growing as fast as the solutions. Understanding graph query languages isn't about whether AI can write them—it's about understanding what needs to be written, why it works (or doesn't), and how to make it better.
Plus, honestly? There's something deeply satisfying about reading a complex Cypher query and thinking, "Yeah, I know exactly what that does." That's worth learning, AI or no AI. 😊
Key Takeaways
- OpenCypher is the people's champion: Most popular, visual syntax, widely adopted
- GSQL is for scale: Map-reduce patterns and accumulators for billion-node graphs
- GQL is the future: ISO standard emerging, SQL-like, future-proof
- MATCH finds patterns: Declarative pattern matching is Cypher's superpower
- WHERE filters, RETURN shapes: Basic clauses you'll see everywhere
- CREATE adds, MERGE upserts: CREATE makes duplicates, MERGE doesn't
- Variable-length paths are powerful but dangerous: Always set max length
- Shortest path uses BFS: Understanding algorithms helps debug performance
- Query plans reveal execution: EXPLAIN and PROFILE are your debugging friends
- Declarative vs imperative: Different tools for different scales
- Accumulators enable distributed aggregation: GSQL's secret sauce for massive graphs
- Optimization matters: Indexes, early filtering, avoiding Cartesian products
- AI will write queries, but you need to read them: Literacy > authorship
Now go forth and read graph queries with confidence! And when the AI inevitably tries to use CREATE where it should use MERGE, you'll catch it. 😉
Remember: The best code is code you understand—whether you wrote it or an AI did. This chapter gave you the tools to understand graph query languages. Use them wisely (and when the AI messes up, you know we told you so!)