Graph Database Query Optimization: Patterns That Actually Matter
Graph databases promise to make relationship queries that are slow in relational databases fast and natural. This is true, but only when you write queries that align with how graph databases actually work. Bad graph queries can be slower than equivalent relational queries, and the patterns that kill performance aren’t always obvious.
The Cartesian Product Problem
The most common performance killer is accidentally creating Cartesian products in your traversals. This happens when you match multiple patterns without connecting them through shared nodes or relationships.
Consider finding products and customers for analysis:
MATCH (p:Product)
MATCH (c:Customer)
WHERE p.category = 'electronics'
AND c.region = 'APAC'
RETURN p, c
This looks reasonable but is disastrous. You’re matching every electronics product against every APAC customer without any relationship between them. If you have 10,000 products and 50,000 customers, you’re generating 500 million rows.
The fix is ensuring your patterns connect:
MATCH (p:Product)<-[:PURCHASED]-(c:Customer)
WHERE p.category = 'electronics'
AND c.region = 'APAC'
RETURN p, c
Now you’re only returning products actually purchased by APAC customers, which is probably what you meant in the first place.
Variable Length Path Explosions
Variable length paths are powerful but dangerous. They let you find paths of any length between nodes, which sounds great until you realize how many possible paths exist in a dense graph.
Finding all paths of any length between two nodes:
MATCH path = (a:Person {name: 'Alice'})-[*]-(b:Person {name: 'Bob'})
RETURN path
In a social network where people average 200 connections, this query might evaluate trillions of possible paths. The database will struggle or time out.
Limiting path length helps:
MATCH path = (a:Person {name: 'Alice'})-[*..4]-(b:Person {name: 'Bob'})
RETURN path
Limiting to paths of four hops or fewer dramatically reduces the search space. For many use cases, paths beyond a certain length aren’t useful anyway.
Being more specific about relationship types helps further:
MATCH path = (a:Person {name: 'Alice'})-[:KNOWS|:WORKS_WITH*..4]-(b:Person {name: 'Bob'})
RETURN path
Now you’re only traversing specific relationship types, eliminating large portions of the graph from consideration.
Missing or Ineffective Indexes
Graph databases still benefit enormously from indexes on frequently queried properties. Starting a query by scanning all nodes of a type without an index is slow.
If you frequently query by email:
MATCH (u:User {email: '[email protected]'})
RETURN u
You need an index on User.email:
CREATE INDEX user_email FOR (u:User) ON (u.email)
Composite indexes help when you filter on multiple properties:
CREATE INDEX user_email_status FOR (u:User) ON (u.email, u.status)
Full-text indexes are appropriate for text search needs that don’t match exact values. Regular indexes require exact matches; full-text indexes support tokenization and text search operators.
Starting Points Matter Enormously
Graph traversals should start from specific nodes when possible, not scan entire node labels. Starting from specific nodes and traversing outward is fast. Scanning all nodes of a type to find starting points is slow.
Bad approach:
MATCH (c:Customer)-[:PURCHASED]->(p:Product)
WHERE c.region = 'Europe'
AND p.category = 'electronics'
RETURN c, p
This scans all customers, filters to Europe, then traverses to products and filters again.
Better approach with index on region:
MATCH (c:Customer {region: 'Europe'})-[:PURCHASED]->(p:Product {category: 'electronics'})
RETURN c, p
If you have indexes on both properties, the database can use them to find specific starting points and only traverse relationships between indexed nodes.
Best approach if you know specific customers:
MATCH (c:Customer)
WHERE c.id IN $customerIds
MATCH (c)-[:PURCHASED]->(p:Product {category: 'electronics'})
RETURN c, p
Starting from specific known nodes is always fastest when possible.
Aggregation Placement
Where you place aggregations in your query significantly impacts performance. Aggregating early reduces the amount of data processed in subsequent operations.
Inefficient aggregation:
MATCH (c:Customer)-[:PURCHASED]->(p:Product)
WITH c, p
MATCH (p)-[:IN_CATEGORY]->(cat:Category)
RETURN cat.name, count(DISTINCT c) AS customers
This carries individual customer-product-category rows through the entire query before aggregating.
More efficient:
MATCH (c:Customer)-[:PURCHASED]->(p:Product)-[:IN_CATEGORY]->(cat:Category)
RETURN cat.name, count(DISTINCT c) AS customers
Doing all matching before aggregating allows the database to optimize better.
Sometimes aggregating mid-query is beneficial:
MATCH (c:Customer)-[:PURCHASED]->(p:Product)
WITH c, count(p) AS purchases
WHERE purchases > 10
MATCH (c)-[:LIVES_IN]->(region:Region)
RETURN region.name, count(c)
Aggregating purchase counts before the second match reduces the number of customers carried forward.
Using OPTIONAL MATCH Carefully
OPTIONAL MATCH is like SQL’s LEFT JOIN—it returns null when the pattern doesn’t match. But like LEFT JOINs, it can significantly impact performance if used carelessly.
Multiple OPTIONAL MATCHes can create explosions:
MATCH (p:Product)
OPTIONAL MATCH (p)<-[:PURCHASED]-(c:Customer)
OPTIONAL MATCH (p)<-[:REVIEWED]-(r:Review)
RETURN p, collect(c), collect(r)
Each OPTIONAL MATCH potentially multiplies the number of rows. If a product has 100 purchasers and 50 reviews, you get 5,000 rows before aggregation.
Separating the optional patterns helps:
MATCH (p:Product)
WITH p
OPTIONAL MATCH (p)<-[:PURCHASED]-(c:Customer)
WITH p, collect(c) AS customers
OPTIONAL MATCH (p)<-[:REVIEWED]-(r:Review)
RETURN p, customers, collect(r) AS reviews
Aggregating after each OPTIONAL MATCH prevents row explosion.
Direction Matters
Relationships have direction, even if your data model treats them as bidirectional. Query performance can vary significantly based on whether you traverse with or against relationship direction.
Generally, traversing in the direction relationships were created is faster. If you create (:Person)-[:KNOWS]->(:Person) relationships, querying with that direction is typically faster than querying against it.
If you need to query both directions regularly, consider creating relationships in both directions or using undirected patterns:
MATCH (a:Person {name: 'Alice'})-[:KNOWS]-(b:Person)
RETURN b
The pattern without arrow direction matches relationships in either direction. This is convenient but potentially slower than directed traversal if indexes and database internals optimize for specific directions.
Limiting and Pagination
Always limit results during development to avoid accidentally returning massive result sets. Add LIMIT even to queries you think will return small results—you might be wrong, and limiting prevents disasters.
For pagination, use SKIP and LIMIT:
MATCH (p:Product)
RETURN p
ORDER BY p.name
SKIP 100
LIMIT 10
This gets the 11th page of 10 results. Be aware that large SKIP values can be slow because the database still has to process the skipped rows.
For better pagination on large datasets, use cursor-based pagination with indexed properties:
MATCH (p:Product)
WHERE p.name > $lastSeenName
RETURN p
ORDER BY p.name
LIMIT 10
This is more efficient for deep pagination than SKIP because it doesn’t require processing all skipped rows.
Profiling Queries
Use EXPLAIN and PROFILE to understand query execution:
EXPLAIN
MATCH (p:Product)<-[:PURCHASED]-(c:Customer)
WHERE p.category = 'electronics'
RETURN c.name, count(p)
EXPLAIN shows the execution plan without running the query. PROFILE actually executes and shows row counts at each step.
Look for:
- Scans without index usage (NodeByLabelScan instead of NodeIndexSeek)
- High db hits relative to rows returned
- Cartesian products (CartesianProduct operator)
- Large row counts early in the plan that get filtered later
These indicate optimization opportunities.
Write Query Performance
Write performance has different considerations. Batching writes is crucial—inserting one row at a time is slow.
Bad:
for row in data:
session.run("CREATE (p:Product {name: $name})", name=row['name'])
Better:
session.run("UNWIND $rows AS row CREATE (p:Product {name: row.name})", rows=data)
Processing multiple rows in a single query reduces transaction overhead and allows better optimization.
MERGE operations are slower than CREATE because they must check for existence. If you know data doesn’t exist, use CREATE. If you’re unsure, use MERGE, but be aware of the performance cost.
Creating indexes before bulk loading speeds up the load because the database doesn’t need to maintain indexes during insertion. Drop indexes, load data, recreate indexes.
When to Use Relational Instead
Sometimes relational databases are better choices. If your queries don’t involve traversing relationships—you’re just filtering and aggregating on properties—a relational database might perform better.
Graph databases excel at:
- Traversing relationships of unpredictable depth
- Finding paths between nodes
- Queries where relationship existence matters more than properties
- Exploring connected data interactively
Relational databases often excel at:
- Aggregating large amounts of data based on property values
- Simple filtering without relationship traversal
- Highly structured data with fixed schema
- SQL-based reporting and BI tools
Real Performance Tuning
Most graph database performance issues resolve through:
- Adding appropriate indexes
- Fixing Cartesian products in queries
- Limiting variable-length path traversals
- Starting queries from specific nodes rather than label scans
- Aggregating earlier in queries
These five patterns address 80% of performance problems. Focus here before exotic optimization techniques.
Actual query execution time should guide optimization. Profile queries with real data volumes. What performs well on test datasets might perform terribly at production scale, and vice versa.
Graph databases are powerful tools that enable queries impossible or impractical in relational databases. But that power requires understanding their execution model and writing queries that align with it. The patterns that make relational queries fast don’t always translate to graph queries, and that’s okay—you just need to learn the graph-specific patterns instead.