Quiz: Advanced Topics and Distributed Systems
Test your understanding of distributed graph databases, partitioning strategies, replication models, consistency trade-offs, graph visualization, real-time versus batch analytics, and capstone project design.
1. What is the fundamental challenge of distributed graph databases?
- Distributed systems are always slower
- Graphs are inherently interconnected, but distributed systems require splitting data across machines, making traversals that cross machine boundaries expensive
- Distributed databases cannot store graphs
- There is no challenge
Show Answer
The correct answer is B. The core challenge is that graphs consist of nodes pointing to other nodes, but in a distributed system, those connected nodes might be on different physical machines. Following an edge that crosses a partition boundary requires a network call (milliseconds) instead of a memory pointer lookup (microseconds). This fundamental tension between graph connectivity and physical distribution is why partitioning strategies, data locality, and query optimization are critical in distributed graph systems.
Concept Tested: Distributed Graph Databases
2. What is graph partitioning and what makes it challenging?
- Deleting parts of a graph
- Dividing a graph into subgraphs for distribution, challenging because minimizing edge cuts conflicts with maintaining balanced partition sizes
- Partitioning is trivial
- Graphs cannot be partitioned
Show Answer
The correct answer is B. Graph partitioning divides a large graph into smaller subgraphs that can be distributed across machines. The challenge is achieving multiple conflicting goals: minimize edge cuts (edges crossing partition boundaries, which require network calls), maintain balanced partitions (equal sizes), and preserve community structure (keep related nodes together). Achieving perfect balance often requires cutting through important communities, while minimizing cuts may create wildly imbalanced partitions. This trade-off is fundamental to distributed graph systems.
Concept Tested: Graph Partitioning
See: Graph Partitioning
3. Which sharding strategy is best for workloads dominated by multi-hop graph traversals?
- Random hash-based sharding
- Graph-aware partitioning that minimizes edge cuts between partitions
- Alphabetical sharding
- No sharding strategy helps traversals
Show Answer
The correct answer is B. For traversal-heavy workloads, graph-aware partitioning (using algorithms like modularity-based community detection) minimizes the number of edges crossing partition boundaries. This keeps frequently co-traversed nodes on the same physical machine, dramatically reducing network calls during multi-hop queries. While more complex to compute and maintain than hash-based sharding, it can improve traversal performance by 10-100x for connected queries. Hash-based sharding (A) scatters connected nodes randomly, maximizing cross-partition edges—terrible for traversals.
Concept Tested: Sharding Strategies, Graph Partitioning
See: Sharding Strategies
4. What is the CAP theorem and why must distributed systems choose between consistency and availability?
- CAP stands for Cost, Availability, Performance
- CAP theorem states you can have at most two of Consistency, Availability, and Partition tolerance; since network partitions are inevitable, systems must choose C or A
- All distributed systems are CAP-compliant
- CAP theorem only applies to relational databases
Show Answer
The correct answer is B. The CAP theorem proves that distributed systems can provide at most two of three guarantees: Consistency (all nodes see the same data), Availability (every request gets a response), and Partition tolerance (system works despite network failures). Since network partitions are inevitable in real distributed systems, you must have P, reducing the choice to CP (sacrifice availability for consistency) or AP (sacrifice consistency for availability). This fundamental trade-off shapes all distributed graph database architectures.
Concept Tested: Consistency Models
5. How does replication improve distributed graph database resilience and performance?
- Replication slows systems down
- Maintaining multiple copies of data across machines provides fault tolerance (survives machine failures) and enables serving reads from nearest replica
- Replication only wastes storage
- Distributed systems don't use replication
Show Answer
The correct answer is B. Replication keeps multiple copies of each data partition on different machines (typically RF=3, replication factor of 3). This provides fault tolerance—if one machine fails, the data still exists on other replicas. It also improves read performance by allowing queries to be served from the geographically nearest or least-loaded replica. The trade-off is increased storage cost (3x for RF=3) and complexity in keeping replicas consistent when writes occur.
Concept Tested: Replication
6. What distinguishes real-time graph analytics from batch processing?
- They are the same thing
- Real-time analytics answer queries in milliseconds while users wait (traversals, pattern matching), while batch processing computes over the entire graph offline (PageRank, community detection)
- Batch is always better
- Real-time cannot use graphs
Show Answer
The correct answer is B. Real-time analytics serve interactive queries with sub-second response times, typically involving traversals of small subgraphs (shortest paths, neighborhood exploration, pattern matching). Batch processing runs algorithms that must touch every node and edge, taking minutes or hours (global PageRank, connected components, community detection across entire graph). Real-time requires careful partitioning to minimize cross-partition hops; batch processing accepts network shuffle overhead during iterative computations. Both are valuable for different use cases.
Concept Tested: Real-Time Analytics, Batch Processing
7. Given a distributed graph database with replication factor RF=3, write quorum W=2, and read quorum R=2, what consistency guarantee exists?
- No consistency guarantee
- Strong consistency because R + W > RF, ensuring reads always see latest writes
- Eventual consistency only
- Writes will always fail
Show Answer
The correct answer is B. With R + W > RF (2 + 2 = 4 > 3), quorum reads are guaranteed to see the most recent write. Here's why: a write to W=2 replicas must succeed before acknowledging the client, and a read from R=2 replicas ensures at least one of those replicas participated in the most recent write (since W + R = 4 but only RF = 3 replicas exist, there must be overlap). This tunable consistency is more flexible than master-slave (always strong) or multi-master (eventual) replication.
Concept Tested: Consistency Models, Replication
8. What is the primary performance bottleneck in distributed batch graph processing?
- CPU speed
- Network shuffle phase where vertex state updates are exchanged across machines between iterations
- Disk storage
- User interface
Show Answer
The correct answer is B. In iterative batch graph algorithms (PageRank, community detection), each iteration requires workers to exchange updated vertex values across the network—the "shuffle phase." For large graphs, this can involve gigabytes per second of network traffic. Algorithms that minimize shuffle by using vertex-local computations or clever aggregations perform far better than those requiring frequent state exchange. This is why frameworks like Pregel emphasize message-passing efficiency and why network topology matters in cluster design.
Concept Tested: Batch Processing
9. What techniques make graph visualization effective for large distributed graphs?
- Displaying all nodes simultaneously
- Sampling subgraphs, aggregating communities into super-nodes, and progressive disclosure (expand-on-click) to manage visual complexity
- Random node placement
- Text-only display
Show Answer
The correct answer is B. Visualizing billion-node graphs requires managing complexity through: sampling (show important nodes or random samples, not everything), aggregation (roll up communities into single "super-nodes"), and progressive disclosure (start high-level, let users drill down by clicking). Layout algorithms (force-directed, hierarchical) position nodes to reveal structure. Filters let users hide/show by type or property. These techniques transform overwhelming hairballs into comprehensible, interactive visualizations that reveal graph insights.
Concept Tested: Graph Visualization
10. What criteria should guide choosing between single-server and distributed graph databases?
- Always choose distributed
- Consider graph size, availability requirements, query locality, and operational maturity—distribute when data exceeds single-server capacity or high availability is critical
- Single-server is always better
- Cost is the only factor
Show Answer
The correct answer is B. The decision to distribute should be based on: (1) Data size—if your graph fits in a large server's RAM (<100GB), single-server may suffice; (2) Availability requirements—if downtime is unacceptable, distributed replication provides resilience; (3) Query patterns—if most queries touch small subgraphs, distribution works well, but whole-graph queries fight the architecture; (4) Operational capability—running distributed systems requires monitoring, failover planning, and expertise. Many applications should start single-server and distribute only when hitting concrete limits.
Concept Tested: Distributed Graph Databases, Capstone Project Design
Quiz Complete!
Questions: 10 Cognitive Levels: Remember (2), Understand (3), Apply (3), Analyze (2) Concepts Covered: Distributed Graph Databases, Graph Partitioning, Sharding Strategies, Replication, Consistency Models, Graph Visualization, Interactive Queries, Real-Time Analytics, Batch Processing, Capstone Project Design
Next Steps: - Review Chapter Content for distributed systems concepts - Design your capstone project synthesizing all course concepts - Apply graph database principles to a real-world domain
Congratulations on completing all 12 chapter quizzes!