Graph Theory Fundamentals

Summary

Builds a foundation in graph theory covering node and edge types, directed and weighted graphs, bipartite and labeled property graphs, centrality measures, traversal algorithms, and network models such as scale-free and small-world networks.

Concepts Covered

This chapter covers the following 40 concepts from the learning graph:

Graph Theory
Nodes and Edges
Directed Graphs
Undirected Graphs
Weighted Graphs
Bipartite Graphs
Labeled Property Graph
Multigraph
Hypergraph
Subgraph
Graph Properties
Degree Distribution
In-Degree
Out-Degree
Clustering Coefficient
Centrality Measures
Degree Centrality
Betweenness Centrality
Closeness Centrality
Eigenvector Centrality
PageRank
Connected Components
Strongly Connected Comp
Graph Traversal
Breadth-First Search
Depth-First Search
Shortest Path Algorithms
Dijkstra Algorithm
Graph Density
Graph Diameter
Scale-Free Networks
Small-World Networks
Power-Law Distribution
Random Graph Models
Erdos-Renyi Model
Barabasi-Albert Model
Network Motifs
Graph Isomorphism
Adjacency Matrix
Edge List Representation

Prerequisites

This chapter assumes only the prerequisites listed in the course description.

Welcome, Explorers! Let's Connect the Dots!

Graphs are everywhere in biology — from protein interaction networks to metabolic pathways, from phylogenetic trees to gene regulatory circuits. In this chapter, we build the mathematical vocabulary and algorithmic toolkit that will power every biological network analysis in the rest of this course. Let's connect the dots!

What Is Graph Theory?

Graph theory is the branch of mathematics that studies structures made of objects and the connections between them. A graph \(G\) is formally defined as an ordered pair \(G = (V, E)\), where \(V\) is a set of nodes (also called vertices) and \(E\) is a set of edges (also called links or arcs) that connect pairs of nodes. The field originated with Leonhard Euler's 1736 solution to the Konigsberg bridge problem — arguably the first theorem of topology — and has since become one of the most widely applied areas of discrete mathematics.

In bioinformatics, graph theory provides the natural mathematical framework for modeling biological relationships. Proteins interact with other proteins, genes regulate other genes, metabolites flow through enzymatic reactions, and diseases share genetic risk factors. All of these relationships map directly to nodes and edges, which is why graph-based approaches have become central to modern computational biology.

Biological System	Nodes	Edges
Protein-protein interaction (PPI) network	Proteins	Physical binding or co-complex membership
Gene regulatory network	Genes or transcription factors	Regulatory activation or repression
Metabolic network	Metabolites and enzymes	Biochemical reactions
Phylogenetic tree	Species or sequences	Evolutionary descent
Drug-target network	Drugs and proteins	Binding affinity

Nodes and Edges: The Building Blocks

Every graph is built from just two primitives: nodes and edges. A node represents an entity — a protein, a gene, a metabolite, a patient — and an edge represents a relationship between two entities. The simplest possible graph contains a single node and no edges; the simplest non-trivial graph contains two nodes joined by one edge.

Edges can carry different semantics depending on the biological context. An edge in a PPI network might mean "these two proteins co-immunoprecipitated in a pull-down assay," while an edge in a knowledge graph might mean "this drug treats this disease." The meaning is determined by the domain, not by the mathematics — graph theory provides the structural analysis tools, and the biologist provides the interpretation.

Formally, each edge \(e \in E\) connects a pair of nodes. In an undirected graph, an edge is an unordered pair \(\{u, v\}\), meaning the relationship is symmetric — if protein A interacts with protein B, then protein B also interacts with protein A. In a directed graph (or digraph), an edge is an ordered pair \((u, v)\), indicating a directional relationship — gene A activates gene B, but that does not imply gene B activates gene A.

Types of Graphs

Directed and Undirected Graphs

The distinction between directed graphs and undirected graphs is fundamental. Gene regulatory networks are inherently directed: a transcription factor binds to a promoter and activates (or represses) a target gene. The regulatory signal flows in one direction. Protein-protein interaction networks, by contrast, are typically modeled as undirected because physical binding is mutual.

In a directed graph, each node has two distinct degree counts. The in-degree of a node \(v\) is the number of edges pointing toward \(v\), and the out-degree is the number of edges pointing away from \(v\). A node with high in-degree in a gene regulatory network is a gene that receives regulatory input from many transcription factors — a potential integration hub. A node with high out-degree is a master regulator that controls many downstream targets.

\[\text{deg}^{-}(v) = |\{(u, v) \in E\}| \qquad \text{deg}^{+}(v) = |\{(v, w) \in E\}|\]

Weighted Graphs

A weighted graph assigns a numerical value \(w(e)\) to each edge \(e\). In bioinformatics, edge weights can represent interaction confidence scores (e.g., STRING database scores from 0 to 1000), binding affinities (dissociation constants), expression correlation coefficients, or evolutionary distances. Formally, a weighted graph is a triple \(G = (V, E, w)\) where \(w: E \rightarrow \mathbb{R}\).

Weights transform the topology of a graph into a quantitative landscape. Two proteins might both interact with a hub protein, but if one interaction has a confidence score of 0.95 and the other has a score of 0.3, their biological significance differs dramatically. Many graph algorithms — shortest path, minimum spanning tree, community detection — can exploit weight information to produce more biologically meaningful results.

Bipartite Graphs

A bipartite graph is a graph whose nodes can be partitioned into two disjoint sets \(U\) and \(W\) such that every edge connects a node in \(U\) to a node in \(W\) — no edges exist within the same set. Metabolic networks provide a classic example: one partition contains metabolites and the other contains enzymes, with edges connecting each enzyme to the substrates it consumes and the products it generates.

Other bioinformatics applications of bipartite graphs include:

Drug-target networks — drugs in one partition, protein targets in the other
Gene-disease association networks — genes in one partition, disease phenotypes in the other
Transcription factor-gene networks — TFs in one partition, target genes in the other

Diagram: Bipartite Graph of a Metabolic Sub-network

View Bipartite Metabolic Network MicroSim Fullscreen

About this MicroSim

- sim-id: bipartite-metabolic - Library: p5.js - Status: Specified This interactive diagram shows a bipartite layout of a simplified glycolysis sub-network with enzyme nodes on one side and metabolite nodes on the other. Hover over nodes to highlight connected edges and partners.

What's the Link?

Notice how bipartite graphs force us to separate node types. This constraint mirrors real biology — an enzyme is not a metabolite, and keeping them in distinct partitions prevents us from accidentally creating nonsensical enzyme-enzyme "reaction" edges. Structure enforces semantics!

Labeled Property Graphs

A labeled property graph extends the basic graph model by attaching labels (type tags) and properties (key-value pairs) to both nodes and edges. This is the data model used by graph databases such as Neo4j and is particularly powerful for knowledge graphs in bioinformatics. A node might carry the label Protein with properties {name: "TP53", organism: "Homo sapiens", length: 393}, while an edge might carry the label INTERACTS_WITH with properties {method: "yeast two-hybrid", confidence: 0.87}.

The labeled property graph model is more expressive than simple mathematical graphs because it can represent heterogeneous biological data — proteins, genes, diseases, drugs, pathways, and publications — in a single unified structure. This flexibility makes it the backbone of modern biomedical knowledge graphs such as Hetionet and the Monarch Initiative.

Multigraphs and Hypergraphs

A multigraph allows multiple edges between the same pair of nodes. This is useful when two proteins interact through multiple distinct mechanisms — for example, one physical binding interaction and one phosphorylation event. Each edge represents a different type of relationship, and keeping them separate preserves biological detail that would be lost if we collapsed them into a single edge.

A hypergraph generalizes the concept of an edge to connect any number of nodes simultaneously, not just pairs. A hyperedge \(e \subseteq V\) can contain two, three, or more nodes. Protein complexes are a natural application: the ribosome is not a pairwise interaction — it is a multi-protein assembly that functions as a unit. Hypergraphs can represent such higher-order relationships directly, whereas standard graphs must approximate them using cliques (fully connected subgraphs).

Subgraphs

A subgraph \(G' = (V', E')\) of a graph \(G = (V, E)\) is a graph where \(V' \subseteq V\) and \(E' \subseteq E\), with every edge in \(E'\) connecting nodes that are both in \(V'\). Subgraph extraction is one of the most common operations in biological network analysis — for example, extracting the subnetwork of all proteins in a particular signaling pathway from a genome-wide PPI network, or isolating the metabolic subnetwork active in a specific tissue.

Graph Representations

How we store a graph in computer memory profoundly affects algorithm performance. Two standard representations dominate.

Adjacency Matrix

An adjacency matrix \(A\) for a graph with \(n\) nodes is an \(n \times n\) matrix where entry \(A_{ij} = 1\) if there is an edge from node \(i\) to node \(j\), and \(A_{ij} = 0\) otherwise. For weighted graphs, \(A_{ij}\) stores the edge weight instead of 1. The adjacency matrix for an undirected graph is always symmetric (\(A_{ij} = A_{ji}\)).

\[A_{ij} = \begin{cases} w(i,j) & \text{if } (i,j) \in E \\ 0 & \text{otherwise} \end{cases}\]

Adjacency matrices support constant-time edge lookup — checking whether two nodes are connected takes \(O(1)\) time. However, they require \(O(n^2)\) space regardless of how many edges exist, which is wasteful for sparse biological networks. A human PPI network has roughly 20,000 protein nodes but only about 300,000 known interactions — the adjacency matrix would contain 400 million entries, of which 99.85% are zeros.

Edge List Representation

An edge list stores the graph as a list of \((u, v)\) pairs (or \((u, v, w)\) triples for weighted graphs). This representation uses \(O(|E|)\) space, making it efficient for sparse graphs. Edge lists are also the most common format in bioinformatics data files — the STRING database, BioGRID, and IntAct all distribute their interaction data as tab-separated edge lists.

Representation	Space	Edge Lookup	Iterate Neighbors	Best For
Adjacency matrix	\(O(n^2)\)	\(O(1)\)	\(O(n)\)	Dense graphs, matrix algorithms
Edge list	\(O(m)\)	\(O(m)\)	\(O(m)\)	Sparse graphs, file storage
Adjacency list	\(O(n + m)\)	\(O(\text{deg}(v))\)	\(O(\text{deg}(v))\)	General-purpose, most algorithms

Follow the Edges!

When working with biological networks, always check the sparsity first. Most biological networks are sparse (density below 1%), so adjacency lists or edge lists will save you orders of magnitude in memory compared to adjacency matrices. Reserve matrix representations for algorithms that specifically require them, such as spectral clustering or eigenvalue computations.

Graph Properties

Several global properties characterize a graph's overall structure and are routinely reported in biological network studies.

Degree Distribution

The degree distribution \(P(k)\) gives the probability that a randomly chosen node has degree \(k\). It is one of the most informative summaries of network topology. To compute it, count the number of nodes with each degree value and normalize by the total number of nodes.

\[P(k) = \frac{|\{v \in V : \text{deg}(v) = k\}|}{|V|}\]

Plotting the degree distribution on a log-log scale reveals whether the network follows a power-law distribution — a hallmark of scale-free networks that we will discuss later in this chapter.

Graph Density

Graph density measures how close a graph is to being complete (having all possible edges). For an undirected graph:

\[D = \frac{2|E|}{|V|(|V|-1)}\]

For a directed graph, the factor of 2 is removed because edges \((u,v)\) and \((v,u)\) are distinct. Most biological networks have densities well below 0.01, meaning less than 1% of all possible edges are present.

Graph Diameter

The graph diameter is the longest shortest path between any pair of nodes in a connected graph:

\[\text{diam}(G) = \max_{u,v \in V} d(u, v)\]

where \(d(u,v)\) is the length of the shortest path from \(u\) to \(v\). Biological networks typically have small diameters relative to their size — a property known as the "small-world" effect. The human PPI network, with roughly 20,000 nodes, has an estimated diameter of only 10-12, meaning any two human proteins are separated by at most about a dozen interaction steps.

Clustering Coefficient

The clustering coefficient of a node \(v\) measures the fraction of \(v\)'s neighbors that are also connected to each other. It captures the tendency for networks to form tightly-knit groups or "cliques."

\[C(v) = \frac{2 \cdot |\{(u,w) \in E : u, w \in N(v)\}|}{\text{deg}(v) \cdot (\text{deg}(v) - 1)}\]

where \(N(v)\) is the set of neighbors of \(v\). The global clustering coefficient is the average over all nodes. Biological networks typically have clustering coefficients much higher than random graphs of the same size and density — evidence that biology is organized into functional modules such as protein complexes and metabolic pathways.

Connected Components

A connected component of an undirected graph is a maximal set of nodes such that there exists a path between every pair of nodes in the set. Most biological networks have one large connected component (the "giant component") containing the majority of nodes, plus several small isolated components. In a directed graph, strongly connected components are maximal subsets where every node is reachable from every other node following edge directions. Strongly connected components in gene regulatory networks can indicate feedback loops and oscillatory circuits.

Centrality Measures

Centrality measures quantify the importance or influence of individual nodes within a graph. Different centrality definitions capture different notions of "importance," and choosing the right measure depends on the biological question.

Degree Centrality

Degree centrality is the simplest measure — it is simply the degree of a node normalized by the maximum possible degree:

\[C_D(v) = \frac{\text{deg}(v)}{|V| - 1}\]

In a PPI network, high-degree proteins (called "hubs") tend to be essential genes. Deletion of hub proteins in yeast is more likely to be lethal than deletion of peripheral proteins, a finding known as the "centrality-lethality" hypothesis.

Betweenness Centrality

Betweenness centrality measures how often a node lies on the shortest path between other pairs of nodes:

\[C_B(v) = \sum_{s \neq v \neq t} \frac{\sigma_{st}(v)}{\sigma_{st}}\]

where \(\sigma_{st}\) is the total number of shortest paths from \(s\) to \(t\), and \(\sigma_{st}(v)\) is the number of those paths that pass through \(v\). Nodes with high betweenness centrality serve as bridges or bottlenecks in the network. In a metabolic network, a metabolite with high betweenness might be a critical intermediate that connects otherwise separate pathways — a potential drug target because disrupting it would fragment metabolic flow.

Closeness Centrality

Closeness centrality measures how close a node is to all other nodes, defined as the reciprocal of the average shortest-path distance:

\[C_C(v) = \frac{|V| - 1}{\sum_{u \neq v} d(v, u)}\]

Nodes with high closeness centrality can reach other nodes quickly. In a signaling network, a protein with high closeness can propagate a signal to the rest of the network in fewer steps.

Eigenvector Centrality

Eigenvector centrality extends degree centrality by accounting for the importance of a node's neighbors. A node is important if it is connected to other important nodes. Formally, the eigenvector centrality \(x_v\) of node \(v\) is proportional to the sum of centralities of its neighbors:

\[x_v = \frac{1}{\lambda} \sum_{u \in N(v)} x_u\]

where \(\lambda\) is the largest eigenvalue of the adjacency matrix \(A\). In matrix notation, the centrality vector \(\mathbf{x}\) satisfies \(A\mathbf{x} = \lambda \mathbf{x}\). This connects graph centrality to linear algebra — the centrality scores are the components of the dominant eigenvector of the adjacency matrix.

PageRank

PageRank is a variant of eigenvector centrality originally developed by Larry Page and Sergey Brin for ranking web pages. It works on directed graphs and includes a "damping factor" \(d\) (typically 0.85) that models random jumps:

\[PR(v) = \frac{1-d}{|V|} + d \sum_{u \in N^{-}(v)} \frac{PR(u)}{\text{deg}^{+}(u)}\]

In bioinformatics, PageRank has been applied to prioritize disease genes in gene-disease association networks, rank proteins by functional importance in directed signaling networks, and identify key metabolites in flux-balance models.

What's the Link?

Each centrality measure answers a different biological question. Degree centrality asks "who has the most partners?" Betweenness asks "who controls the flow?" Closeness asks "who can spread signals fastest?" Eigenvector centrality asks "who knows the most important people?" Always choose the measure that matches your biological hypothesis.

Centrality Measure	What It Captures	Biological Example
Degree	Number of direct connections	Hub proteins in PPI networks
Betweenness	Control over information flow	Metabolic bottleneck enzymes
Closeness	Efficiency of signal propagation	Central signaling proteins
Eigenvector	Influence through high-status neighbors	Core members of essential complexes
PageRank	Directional influence propagation	Disease gene prioritization

Graph Traversal Algorithms

Graph traversal refers to the systematic process of visiting every node in a graph exactly once. The two fundamental traversal strategies are breadth-first search and depth-first search, and they underpin many more sophisticated graph algorithms.

Breadth-First Search (BFS)

Breadth-first search explores all neighbors of the current node before moving to the next level. Starting from a source node \(s\), BFS visits all nodes at distance 1, then all nodes at distance 2, and so on. It uses a queue (first-in, first-out) to maintain the exploration frontier.

BFS is the standard algorithm for computing shortest paths in unweighted graphs. In bioinformatics, BFS can determine the minimum number of interaction steps between two proteins, identify all proteins within a given radius of a disease-associated protein, or enumerate connected components.

The time complexity of BFS is \(O(|V| + |E|)\), which is optimal since every node and edge must be examined at least once.

Depth-First Search (DFS)

Depth-first search explores as far as possible along each branch before backtracking. Starting from a source node, DFS follows a path until it reaches a dead end, then backtracks to the most recent unexplored branch. It uses a stack (last-in, first-out) — either an explicit stack data structure or the implicit call stack of a recursive implementation.

DFS is used for topological sorting of directed acyclic graphs, detecting cycles in regulatory networks, and finding strongly connected components (via Tarjan's or Kosaraju's algorithm). Its time complexity is also \(O(|V| + |E|)\).

Diagram: BFS vs DFS Traversal Comparison

View BFS vs DFS Traversal MicroSim Fullscreen

About this MicroSim

- sim-id: bfs-vs-dfs - Library: p5.js - Status: Specified This side-by-side animation demonstrates BFS (left) and DFS (right) on the same graph. Nodes light up in traversal order so students can see how BFS expands layer by layer while DFS dives deep before backtracking. Controls allow stepping through one node at a time or running the animation continuously.

Shortest Path Algorithms

Finding the shortest path between two nodes is one of the most fundamental graph problems. In unweighted graphs, BFS solves this in \(O(|V| + |E|)\) time. For weighted graphs, specialized algorithms are required.

Dijkstra's Algorithm

The Dijkstra algorithm finds the shortest path from a single source node to all other nodes in a weighted graph with non-negative edge weights. It maintains a priority queue of nodes sorted by their tentative distance from the source, and at each step it extracts the node with the smallest tentative distance and relaxes its outgoing edges.

The time complexity with a binary heap is \(O((|V| + |E|) \log |V|)\). In bioinformatics, Dijkstra's algorithm is used to find the most reliable interaction path between two proteins (where edge weights represent interaction confidence), the shortest metabolic conversion route between two compounds, or the minimum evolutionary distance between species in a phylogenetic network.

Watch Out for Negative Weights!

Dijkstra's algorithm does not work correctly with negative edge weights. If your biological network uses negative weights — for example, log-transformed probabilities where higher confidence yields more negative values — you must use the Bellman-Ford algorithm instead, which handles negative weights at the cost of higher time complexity: \(O(|V| \cdot |E|)\).

Network Models

Real biological networks share characteristic structural properties that distinguish them from random graphs. Understanding these properties — and the models that generate them — is essential for interpreting biological network data and for generating realistic null models in statistical tests.

Random Graph Models and the Erdos-Renyi Model

The Erdos-Renyi model \(G(n, p)\) generates a random graph by connecting each pair of \(n\) nodes independently with probability \(p\). This model produces graphs with a binomial degree distribution that approximates a Poisson distribution for large \(n\):

\[P(k) \approx \frac{e^{-\langle k \rangle} \langle k \rangle^k}{k!}\]

where \(\langle k \rangle = p(n-1)\) is the expected degree. The Erdos-Renyi model serves as the simplest null model in network biology. When researchers discover a structural property in a biological network (such as high clustering), they compare it to the expected value in an Erdos-Renyi random graph of the same size and density to determine whether the property is statistically significant.

Key properties of Erdos-Renyi random graphs include:

A giant connected component emerges when \(p > 1/n\)
The degree distribution is homogeneous (most nodes have similar degree)
The clustering coefficient equals \(p\), which is typically low
The diameter scales as \(O(\log n)\)

Scale-Free Networks and Power-Law Distributions

Many biological networks exhibit a power-law degree distribution:

\[P(k) \sim k^{-\gamma}\]

where \(\gamma\) is typically between 2 and 3. This means most nodes have very few connections while a small number of "hubs" have extremely many. Networks with this property are called scale-free networks because the degree distribution has no characteristic scale — it looks the same at every level of magnification on a log-log plot.

Scale-free networks have important biological implications. They are robust to random node failure (removing a random protein rarely disconnects the network) but vulnerable to targeted attack on hubs (removing hub proteins can fragment the network). This "robust yet fragile" property has consequences for drug targeting strategies — attacking hub proteins in a pathogen's PPI network may be an effective way to disrupt its biology.

The Barabasi-Albert Model

The Barabasi-Albert model explains how scale-free networks arise through two mechanisms: growth (new nodes are continuously added) and preferential attachment (new nodes are more likely to connect to existing nodes that already have many connections). The model starts with a small seed graph of \(m_0\) nodes, and at each time step a new node is added with \(m\) edges. The probability that a new edge connects to existing node \(v\) is proportional to \(v\)'s current degree:

\[P(v) = \frac{\text{deg}(v)}{\sum_{u \in V} \text{deg}(u)}\]

This "rich get richer" mechanism produces power-law degree distributions with \(\gamma = 3\). In biological terms, preferential attachment models the observation that well-studied, highly connected proteins tend to acquire new interaction partners faster than peripheral proteins — partly because of genuine biological function (hub proteins participate in many processes) and partly because of study bias (well-known proteins are more frequently tested for interactions).

Small-World Networks

Small-world networks combine two properties: high clustering (like a regular lattice) and short average path length (like a random graph). The Watts-Strogatz model generates small-world networks by starting with a regular ring lattice and randomly rewiring a fraction of edges. Real biological networks are typically both scale-free and small-world, exhibiting power-law degree distributions, high clustering, and short diameters simultaneously.

The small-world property has biological significance. In a metabolic network with the small-world property, any metabolite can be converted to any other metabolite through a small number of enzymatic steps. In a neural network, signals can propagate from any sensory neuron to any motor neuron through a small number of synaptic connections. This efficient information transfer may be a result of evolutionary optimization.

Diagram: Network Model Comparison

View Network Models Comparison MicroSim Fullscreen

About this MicroSim

- sim-id: network-models - Library: p5.js - Status: Specified This interactive visualization generates and displays three network types side-by-side: Erdos-Renyi random graph, Barabasi-Albert scale-free network, and Watts-Strogatz small-world network. Each panel shows the graph layout along with its degree distribution plotted on a log-log scale. Sliders control the parameters of each model.

Network Motifs and Graph Isomorphism

Network Motifs

Network motifs are small subgraph patterns that occur significantly more often in a real network than in randomized versions of the same network. They are the "building blocks" of complex networks. In gene regulatory networks, the most common motifs include:

Feed-forward loop — a three-node pattern where node A regulates B and C, and B also regulates C
Bi-fan — two regulators that both control two target genes
Single-input module — one regulator controlling multiple targets

Motifs are detected by exhaustively enumerating all subgraphs of a given size (typically 3 or 4 nodes), counting their frequencies, and comparing those frequencies to a null distribution obtained from randomized networks that preserve the degree sequence.

Graph Isomorphism

Two graphs \(G_1 = (V_1, E_1)\) and \(G_2 = (V_2, E_2)\) are isomorphic if there exists a bijection \(f: V_1 \rightarrow V_2\) such that \((u, v) \in E_1\) if and only if \((f(u), f(v)) \in E_2\). In other words, the two graphs have identical structure — they differ only in how the nodes are labeled. Graph isomorphism testing is the computational core of motif detection: to count how many times a particular motif appears, the algorithm must determine whether each extracted subgraph matches the motif pattern.

The general graph isomorphism problem has no known polynomial-time algorithm, but for the small subgraphs used in motif analysis (3-5 nodes), the search space is manageable. Tools such as mfinder, FANMOD, and the igraph library implement efficient subgraph enumeration and isomorphism testing for biological network motif analysis.

What's the Link?

Network motifs bridge graph theory and systems biology. The feed-forward loop motif, for example, functions as a noise filter in gene regulation — the target gene C only turns on when both regulators A and B send a sustained signal, filtering out transient fluctuations. Structure predicts function!

Putting It All Together: A Biological Network Analysis Workflow

A typical graph-based analysis in bioinformatics follows a common pipeline. This workflow integrates many of the concepts from this chapter.

Obtain interaction data — download edges from databases such as STRING, BioGRID, or IntAct (edge list representation)
Construct the graph — load edges into an adjacency list or graph database (labeled property graph for heterogeneous data)
Compute global properties — calculate density, diameter, degree distribution, and clustering coefficient
Identify important nodes — compute centrality measures (degree, betweenness, closeness, eigenvector, PageRank)
Detect modules — find connected components, strongly connected components, and communities
Search for motifs — enumerate network motifs and test for statistical enrichment
Compare to null models — generate Erdos-Renyi or degree-preserving random graphs and compare observed properties
Extract subnetworks — focus on biologically relevant subgraphs for detailed analysis

This pipeline transforms raw interaction data into biological insights — identifying drug targets, predicting gene function, and discovering disease mechanisms.

Congratulations, Investigators!

You have built a complete graph theory toolkit — from the basic vocabulary of nodes and edges all the way to network models and motif analysis. These fundamentals will power every biological network analysis in the chapters ahead. Now go forth and connect the dots!

Key Takeaways

A graph \(G = (V, E)\) consists of nodes (vertices) and edges (links); this simple abstraction captures the structure of PPI networks, regulatory circuits, metabolic pathways, and knowledge graphs.
Graphs can be directed or undirected, weighted or unweighted, bipartite, or enriched with labels and properties. Choosing the right graph type depends on the biology being modeled.
Labeled property graphs support heterogeneous node and edge types with key-value properties, making them ideal for biomedical knowledge graphs.
Multigraphs allow multiple edges between nodes; hypergraphs allow edges connecting more than two nodes (useful for protein complexes).
The adjacency matrix supports fast edge lookup but is memory-intensive; edge lists and adjacency lists are more practical for sparse biological networks.
Degree distribution, density, diameter, and clustering coefficient are the core global properties reported in network biology studies.
Centrality measures — degree, betweenness, closeness, eigenvector, and PageRank — each capture a different dimension of node importance.
BFS and DFS are the two fundamental traversal algorithms; Dijkstra's algorithm extends shortest-path computation to weighted graphs.
Most biological networks are scale-free (power-law degree distribution) and small-world (high clustering, short diameter).
The Erdos-Renyi model generates random graphs for null hypothesis testing; the Barabasi-Albert model explains scale-free topology via preferential attachment.
Network motifs are recurring subgraph patterns that serve as functional building blocks in biological networks.

See Annotated References