Graph Algorithms and Analytics
Summary
This chapter teaches fundamental and advanced graph algorithms for healthcare analytics. You will learn shortest path algorithms, centrality measures (PageRank, betweenness, degree), clustering coefficients, connected components, cycle detection, graph pattern recognition, similarity measures, link prediction, graph embeddings, node embeddings, and graph neural networks. These algorithms provide powerful tools for discovering insights, predicting outcomes, and understanding complex relationships in healthcare data.
Concepts Covered
This chapter covers the following 15 concepts from the learning graph:
- Graph Algorithm
- Shortest Path Algorithm
- Centrality Measure
- Pagerank Algorithm
- Betweenness Centrality
- Degree Centrality
- Clustering Coefficient
- Connected Components
- Cycle Detection
- Graph Pattern Recognition
- Similarity Measure
- Link Prediction
- Graph Embedding
- Node Embedding
- Graph Neural Network
Prerequisites
This chapter builds on concepts from:
- Chapter 01: Graph Theory and Database Foundations
- Chapter 03: Graph Query Languages and Technologies
Introduction
Graph algorithms represent the computational methods that traverse, analyze, and extract insights from graph-structured data, transforming raw connections into actionable intelligence. In healthcare, where relationships between patients, providers, medications, diagnoses, and procedures create intricate networks of causality and correlation, graph algorithms enable discovery of patterns invisible to traditional analytics approaches. Unlike statistical methods that treat observations as independent data points, graph algorithms explicitly model and leverage the relationships between entities, recognizing that understanding the connections is often more valuable than understanding isolated attributes.
The power of graph algorithms in healthcare analytics stems from their ability to answer questions that are difficult or impossible with relational databases or traditional business intelligence tools. How does information flow through a referral network? Which providers are most influential in determining care patterns? What is the most efficient care pathway for a specific condition? Which patients are most similar based on their complete treatment history? These questions require algorithms that can traverse relationships, measure centrality, detect communities, identify patterns, and predict future connections. This chapter explores both classical graph algorithms (shortest path, centrality measures, clustering) and modern graph data science techniques (graph embeddings, graph neural networks) with specific applications to healthcare data and decision-making.
Understanding Graph Algorithms
Before examining specific algorithms, we must understand what makes graph algorithms fundamentally different from traditional data processing approaches and why they are particularly well-suited to healthcare analytics. Graph algorithms are computational procedures that operate on graph-structured data, typically involving traversal of nodes and edges to compute metrics, identify patterns, or make predictions. Unlike algorithms that process tables row-by-row or arrays element-by-element, graph algorithms navigate relationships as first-class entities, with performance characteristics that depend on network topology rather than simple data volume.
Graph algorithms can be categorized into several major families based on their computational approach and application purpose:
Pathfinding algorithms find routes through networks by traversing edges between nodes, answering questions about connectivity, distance, and optimal routes. Applications include care pathway optimization, supply chain analysis, and patient journey mapping.
Centrality algorithms measure the importance or influence of nodes within a network based on their position, connections, or role in information flow. Healthcare applications include identifying key opinion leaders, critical care coordinators, and influential providers.
Community detection algorithms partition networks into groups of densely connected nodes that are sparsely connected to other groups, revealing natural clustering based on relationship patterns. Use cases include patient segmentation, provider network analysis, and treatment protocol identification.
Similarity algorithms measure how alike two nodes are based on their attributes, connections, or neighborhood structure, enabling recommendation systems, duplicate detection, and cohort identification.
Link prediction algorithms estimate the probability of future relationships forming between nodes based on current network structure and node attributes, supporting referral prediction, readmission risk assessment, and care coordination optimization.
The following table compares key characteristics of major graph algorithm families:
| Algorithm Family | Computational Complexity | Typical Healthcare Use Cases | Key Metrics Produced | Example Algorithms |
|---|---|---|---|---|
| Pathfinding | O(V + E) to O(V²) | Care pathways, referral routes, supply chains | Distance, path cost, route | Dijkstra, A*, Breadth-first search |
| Centrality | O(V×E) to O(V³) | Provider influence, care coordinators, key facilities | Centrality scores, rankings | PageRank, Betweenness, Degree |
| Community Detection | O(V×E) to O(V²×log V) | Patient cohorts, provider networks, service clusters | Community membership, modularity | Louvain, Label Propagation, Connected Components |
| Similarity | O(V²) to O(V³) | Patient matching, duplicate detection, recommendations | Similarity scores, rankings | Jaccard, Cosine, Node2Vec |
| Link Prediction | O(V×E) to O(V²) | Referral prediction, readmission risk, care gaps | Probability scores, predictions | Adamic-Adar, Common Neighbors, GraphSAGE |
| Pattern Matching | O(V×E) to exponential | Fraud detection, clinical pathways, quality patterns | Pattern instances, counts | Subgraph isomorphism, Motif finding |
Shortest Path Algorithms and Care Pathways
Shortest path algorithms find the minimum-cost route between two nodes in a graph, where cost might represent physical distance, time, expense, or any other metric assigned to edges. In healthcare, shortest path algorithms enable optimization of patient journeys through complex care processes, identification of efficient referral routes, and discovery of treatment pathways that minimize time to diagnosis or therapeutic intervention.
The most widely used shortest path algorithm is Dijkstra's algorithm, which finds the shortest path from a single source node to all other nodes in a graph with non-negative edge weights. The algorithm maintains a set of nodes with known shortest distances and iteratively expands this set by selecting the unvisited node with minimum distance, updating the distances to its neighbors. Dijkstra's algorithm has time complexity O((V+E) log V) with a priority queue implementation, making it practical for large healthcare networks.
Healthcare applications of shortest path algorithms include:
Care pathway optimization: Finding the fastest route from symptom presentation to definitive diagnosis by traversing a graph of diagnostic tests, specialist consultations, and imaging studies, with edge weights representing time to next step or probability of diagnostic yield.
Referral network navigation: Identifying the most efficient referral path from primary care to specialized treatment by analyzing provider networks, with edge weights representing wait times, geographic distance, or referral acceptance rates.
Supply chain logistics: Optimizing medical supply delivery routes in hospital networks or pharmaceutical distribution networks, with edge weights representing transportation cost, time, or reliability.
Emergency response: Calculating optimal ambulance routing and hospital selection based on real-time capacity, specialty availability, and transport time, with edge weights dynamically updated based on current conditions.
Clinical trial matching: Finding the shortest path from patient diagnosis to appropriate clinical trial enrollment by traversing eligibility criteria, trial availability, and referral processes.
The following Cypher query demonstrates finding the shortest path between a patient's current provider and a specialist with the required expertise:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 | |
Beyond single-source shortest path, all-pairs shortest path algorithms compute the shortest paths between every pair of nodes in the graph, enabling comprehensive network analysis. The Floyd-Warshall algorithm solves all-pairs shortest path in O(V³) time, which is practical for moderate-sized networks (thousands of nodes) but becomes prohibitive for very large graphs. Healthcare applications include analyzing the overall efficiency of referral networks, identifying bottlenecks in care delivery, and computing comprehensive distance matrices for patient-provider matching.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 | |
Centrality Measures: Identifying Important Nodes
Centrality measures quantify the importance, influence, or prominence of nodes within a network based on their structural position and connections. Different centrality algorithms capture different notions of importance: how connected a node is, how central it is to network flow, or how influential it is in the broader network structure. In healthcare networks, centrality measures identify key providers who coordinate care, influential facilities that serve as regional hubs, critical supply chain nodes, and patients who bridge different care communities.
Degree centrality is the simplest centrality measure, counting the number of edges connected to a node (or distinguishing between incoming and outgoing edges in directed graphs). In healthcare provider networks, high degree centrality indicates providers who see many patients, have many referral relationships, or are connected to many other providers. While simple, degree centrality can be misleading because it doesn't distinguish between connections to highly connected nodes versus isolated nodes.
Betweenness centrality measures how often a node appears on shortest paths between other pairs of nodes, identifying nodes that act as bridges or brokers in network flow. In healthcare, high betweenness centrality indicates providers who serve as critical connectors between different parts of the care network, patients who link disparate provider communities, or facilities that are essential waypoints in patient journeys. Betweenness centrality has O(V×E) computational complexity for unweighted graphs, making it expensive for very large networks but feasible for typical healthcare networks of thousands to tens of thousands of nodes.
PageRank is an algorithm originally developed by Google to rank web pages, measuring importance based on both the quantity and quality of incoming links. A node receives high PageRank if it is linked to by many nodes, or by nodes that themselves have high PageRank. In healthcare provider networks, PageRank identifies providers who are trusted by other influential providers, creating a recursive measure of reputation. Unlike degree centrality, PageRank accounts for the quality of connections, not just quantity.
The PageRank algorithm works by iteratively distributing rank scores across the network following this formula:
1 | |
Where: - PR(A) = PageRank of node A - d = damping factor (typically 0.85) - T_i = nodes that link to A - C(T_i) = number of outgoing links from T_i
Healthcare applications of centrality measures include:
- Care coordinator identification: Finding providers with high betweenness centrality who serve as critical coordinators connecting multiple specialties
- Referral network optimization: Using PageRank to identify the most trusted referral targets, helping payers design narrow networks around high-quality providers
- Supply chain resilience: Identifying critical nodes in medical supply networks whose disruption would have cascading effects
- Patient navigation: Finding patients with high betweenness who could serve as peer navigators bridging different communities
- Outbreak tracing: Using centrality to prioritize contact tracing and vaccination of highly connected individuals in disease transmission networks
The following list summarizes key centrality measures and their interpretations:
- Degree Centrality: Number of connections; indicates popularity or activity level
- Betweenness Centrality: Frequency on shortest paths; indicates brokerage or bridge role
- Closeness Centrality: Average distance to all other nodes; indicates accessibility or reach
- Eigenvector Centrality: Connections to well-connected nodes; indicates association with influential nodes
- PageRank: Quality-weighted incoming connections; indicates authority or trustworthiness
- Harmonic Centrality: Sum of inverse distances; variant of closeness that handles disconnected graphs
- Katz Centrality: Walks of all lengths weighted by attenuation factor; generalizes eigenvector centrality
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 | |
Clustering Coefficient and Community Structure
The clustering coefficient measures the degree to which nodes in a network tend to cluster together, forming tightly connected groups. At the local level, a node's clustering coefficient is the fraction of its neighbors that are also connected to each other, ranging from 0 (none of its neighbors are connected) to 1 (all neighbors are fully connected). The global clustering coefficient averages local clustering coefficients across all nodes, providing a network-level measure of community structure.
In healthcare networks, high clustering coefficients indicate strong community formation, which can be interpreted differently depending on context. Among providers, clustering might indicate coordinated care teams working closely together, or it might reveal insular practice patterns that limit patient access to diverse expertise. Among patients, clustering might reveal shared risk factors, geographic communities, or social networks that influence health behaviors.
Connected components are maximal subgraphs where every node is reachable from every other node within the component but not reachable from nodes outside the component. Finding connected components partitions a graph into disjoint groups, revealing the fundamental structure of connectivity. In healthcare applications, connected components identify separate care networks, isolated patient populations, or disconnected supply chains.
The following compares local clustering with global network structure:
Local clustering (node-level): - Measures: Fraction of a node's neighbor pairs that are connected - Range: 0.0 to 1.0 - Interpretation: How tightly grouped are this node's connections? - Healthcare example: A patient's providers all coordinate with each other (high clustering) vs. seeing unconnected specialists (low clustering)
Global clustering (network-level): - Measures: Average of all local clustering coefficients - Range: 0.0 to 1.0 - Interpretation: Overall tendency toward community formation - Healthcare example: Provider network overall has strong care teams (high clustering) vs. fragmented care delivery (low clustering)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 | |
Cycle Detection and Pattern Recognition
Cycle detection identifies closed loops in a graph where a path exists from a node back to itself, which can indicate various phenomena depending on the healthcare context. In financial transactions, cycles might indicate circular money flows suggesting fraud or kickback schemes. In care delivery, cycles might represent patients bouncing between providers without resolution, or appropriate care coordination where patients return to their primary provider after specialist consultation.
Graph pattern recognition (also called subgraph matching or motif detection) finds specific structural patterns within larger graphs, enabling discovery of recurring organizational structures, workflow patterns, or relationship configurations. Patterns can be as simple as triangles (three mutually connected nodes) or as complex as multi-node configurations representing specific clinical pathways or care team structures.
Common graph patterns in healthcare and their interpretations:
Triangle (3-cycle): Three nodes mutually connected - Provider context: Care team collaboration (PCP, specialist, facility all coordinate) - Patient context: Shared care management (patient sees three coordinating providers) - Fraud context: Circular referral pattern (potential kickback indicator)
Star pattern: Central node connected to many peripheral nodes with no connections among peripherals - Provider context: Hub specialist receiving many independent referrals - Patient context: Patient seeing many specialists without coordination - DME context: Fraudulent supplier receiving referrals from many unconnected physicians
Chain pattern: Linear sequence of connections - Care pathway: Patient journey through sequence of providers - Supply chain: Product flow from manufacturer to patient - Information flow: How knowledge propagates through network
Clique: Fully connected subgraph where every node connects to every other node - Care team: Tightly integrated multidisciplinary team - Patient cohort: Patients who share all the same providers and conditions - Research network: Collaborating investigators
The following Cypher query finds circular referral patterns that might indicate kickback schemes:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 | |
Similarity Measures and Link Prediction
Similarity measures quantify how alike two nodes are based on their attributes, connections, or position in the network. Different similarity algorithms capture different notions of likeness: shared neighbors, overlapping attributes, or structural equivalence. In healthcare, similarity measures enable patient matching for cohort studies, duplicate record detection, treatment recommendation based on similar patients, and provider peer comparison.
Common graph-based similarity measures include:
Jaccard similarity measures the overlap between two nodes' neighborhoods as the size of the intersection divided by the size of the union:
1 | |
Where N(A) is the set of neighbors of node A. In healthcare, Jaccard similarity can identify patients with similar provider networks, providers serving similar patient populations, or conditions with overlapping medication regimens.
Cosine similarity measures the cosine of the angle between two vectors representing node attributes or connections, ranging from 0 (orthogonal, completely dissimilar) to 1 (parallel, identical).
Adamic-Adar weights shared neighbors by the inverse logarithm of their degree, giving more importance to rare shared connections:
1 | |
Link prediction estimates the likelihood of future relationships forming between currently unconnected nodes based on network structure and node attributes. Link prediction algorithms analyze existing network patterns to predict missing relationships, future connections, or relationships not yet observed in the data.
Healthcare applications of similarity and link prediction include:
- Readmission prediction: Predicting which patients will return to hospital based on similarity to previously readmitted patients
- Referral prediction: Predicting which specialists a primary care provider will refer to based on referral patterns of similar providers
- Drug interaction discovery: Predicting likely drug-drug interactions based on similarity of molecular structures and known interactions
- Clinical trial matching: Finding patients similar to those who benefited from specific treatments
- Missing diagnosis detection: Identifying likely diagnoses that weren't coded based on similarity to other patients with same symptoms and test results
Graph Embeddings and Representation Learning
Graph embeddings are low-dimensional vector representations of nodes that preserve graph structure, enabling machine learning algorithms to operate on graph data. By transforming nodes into vectors (typically 64-512 dimensions), graph embedding techniques make graph data compatible with standard machine learning tools like classification, regression, and clustering algorithms while capturing complex network topology in numeric features.
Node embeddings specifically represent individual nodes as vectors, where nodes with similar network positions or properties receive similar embedding vectors. The goal is to learn a mapping function that transforms each node into a vector such that the distance between vectors reflects network proximity or structural similarity.
Popular node embedding techniques include:
Node2Vec uses random walks to sample the network neighborhood of each node, then applies Word2Vec-style learning to generate embeddings where nodes appearing in similar walk contexts receive similar embeddings. Node2Vec allows tuning between breadth-first (local structure) and depth-first (global structure) exploration through parameters p and q.
GraphSAGE (Graph Sample and Aggregate) learns embeddings by sampling and aggregating features from a node's local neighborhood using neural networks. Unlike Node2Vec which requires retraining for new nodes, GraphSAGE can generate embeddings for previously unseen nodes (inductive learning).
DeepWalk performs random walks starting from each node and treats the walks as sentences, applying Word2Vec to learn embeddings. DeepWalk is simpler than Node2Vec but doesn't allow tuning exploration strategy.
Healthcare applications of graph embeddings include:
- Patient representation learning: Creating patient embedding vectors that capture complete medical history, provider relationships, and condition patterns for use in predictive models
- Medication recommendation: Embedding medications based on co-prescription networks to recommend alternatives or predict drug interactions
- Disease progression modeling: Learning disease embeddings that capture progression patterns and comorbidity relationships
- Provider performance prediction: Creating provider embeddings that capture practice patterns, patient populations, and outcomes for quality prediction
- Medical concept organization: Embedding clinical concepts (diagnoses, procedures, symptoms) based on co-occurrence to create semantic similarity measures
Graph Neural Networks (GNNs) extend deep learning to graph-structured data by defining neural network layers that operate on graphs, aggregating information from neighboring nodes through message passing. GNNs learn node, edge, or graph-level representations end-to-end for specific prediction tasks, achieving state-of-the-art results on many graph analytics problems.
Key GNN architectures include:
- Graph Convolutional Networks (GCN): Apply convolution-like operations on graphs, aggregating neighbor features
- Graph Attention Networks (GAT): Use attention mechanisms to weight neighbor importance dynamically
- GraphSAGE: Samples and aggregates neighborhood features with learned aggregation functions
- Graph Isomorphism Networks (GIN): Maximally expressive GNN that can distinguish different graph structures
Healthcare applications of GNNs:
- Polypharmacy side effect prediction: Predicting adverse drug combinations using GNNs on drug-drug interaction networks
- Disease diagnosis: Classifying patient diagnoses based on symptom networks and patient similarity graphs
- Molecule property prediction: Predicting drug efficacy or toxicity from molecular graph structure
- Hospital readmission prediction: Using GNNs on patient-provider-diagnosis networks to predict readmission risk
- Treatment outcome prediction: Predicting treatment success based on patient similarity networks and historical outcomes
Summary and Key Takeaways
Graph algorithms provide the computational foundation for extracting insights from healthcare's complex relational data, enabling analysis that would be impossible with traditional approaches. By operating directly on network structure rather than flattened tabular representations, graph algorithms preserve and leverage the rich connectivity inherent in healthcare data—from patient journeys through care systems to provider collaboration networks to disease comorbidity patterns.
Key concepts covered in this chapter include:
- Graph algorithm fundamentals: Understanding how graph algorithms differ from traditional data processing and why they are essential for healthcare analytics
- Shortest path algorithms: Finding optimal routes through care networks, diagnostic pathways, and supply chains using Dijkstra's algorithm and variants
- Centrality measures: Identifying important nodes through degree centrality, betweenness centrality, and PageRank to find key providers, critical facilities, and influential nodes
- Clustering and community detection: Measuring local clustering coefficients and identifying communities through Louvain algorithm to reveal patient cohorts and provider networks
- Cycle detection and pattern recognition: Finding circular patterns that might indicate coordination or fraud, and recognizing recurring structural motifs
- Similarity measures: Quantifying node similarity through Jaccard, cosine, and Adamic-Adar measures for patient matching and recommendation systems
- Link prediction: Predicting future relationships for readmission risk, referral patterns, and missing data imputation
- Graph embeddings: Learning vector representations of nodes through Node2Vec, GraphSAGE, and other techniques for machine learning integration
- Graph Neural Networks: Applying deep learning to graph data for prediction, classification, and representation learning on healthcare networks
As healthcare organizations accumulate ever-larger datasets capturing detailed patient journeys, provider interactions, and treatment outcomes, the ability to apply sophisticated graph algorithms becomes increasingly critical for extracting actionable insights. These algorithms transform raw connectivity into strategic intelligence, supporting clinical decision-making, operational optimization, population health management, and biomedical research.
In the next chapter, we will explore how graph databases integrate with artificial intelligence and machine learning systems, combining graph analytics with predictive models to enable precision medicine, clinical decision support, and intelligent healthcare applications.
References
-
Neo4j Graph Data Science Library - 2024 - Neo4j Documentation - Comprehensive documentation for Neo4j's graph algorithms library covering pathfinding, centrality, community detection, and similarity algorithms essential for healthcare network analysis.
-
Network Science by Albert-László Barabási - 2016 - Cambridge University Press - Foundational textbook on network science explaining graph algorithms, scale-free networks, and community structures with applications relevant to healthcare social networks and disease spread modeling.
-
Graph Algorithms: Practical Examples in Apache Spark and Neo4j - 2019 - O'Reilly Media - Practical guide to implementing graph algorithms including PageRank, community detection, and pathfinding with healthcare use cases and performance optimization strategies.