Genomics, Sequencing, and Bioinformatics
Summary
This chapter covers the technologies and computational methods that power modern genomics, from sequencing platforms to bioinformatics analysis pipelines. Students learn about alignment algorithms, variant calling, file formats, genomic databases, reproducible workflows, and RNA-seq analysis. After completing this chapter, students will understand how to process, analyze, and interpret genomic data using standard bioinformatics tools.
Concepts Covered
This chapter covers the following 41 concepts from the learning graph:
- Genome Sequencing
- Sanger Sequencing
- Next-Gen Sequencing
- Illumina Sequencing
- Long-Read Sequencing
- Whole Genome Sequencing
- Whole Exome Sequencing
- Targeted Sequencing
- Sequence Alignment
- BLAST Algorithm
- Pairwise Alignment
- Multiple Sequence Alignment
- Genome Annotation
- Gene Prediction
- Functional Annotation
- Gene Ontology
- Variant Calling
- VCF File Format
- Variant Annotation
- Variant Classification
- Benign Variant
- Pathogenic Variant
- Variant of Uncertain Sig
- FASTA File Format
- FASTQ File Format
- BAM File Format
- BED File Format
- Genomic Databases
- NCBI Database
- Ensembl Database
- UCSC Genome Browser
- ClinVar Database
- dbSNP Database
- Reproducible Workflows
- Pipeline Automation
- Version Control in Genomics
- Computational Workflow
- RNA-Seq Analysis
- Differential Expression
- Pathway Enrichment
- Data Interpretation
Prerequisites
This chapter builds on concepts from:
- Chapter 1: Genetic Inference and Probabilistic Reasoning
- Chapter 5: Genetic Variation and Genome Diversity
- Chapter 11: Gene Regulation and Regulatory Networks
- Chapter 13: Experimental Genetics and Model Organisms
Welcome to the Digital Side of Genetics!
Welcome back, fellow investigators! Today we trade the lab bench for the computer screen. Modern genetics generates enormous amounts of sequence data, and making sense of it all requires computational tools. Let's look at the evidence -- one nucleotide at a time!
Introduction: The Genomic Revolution
Genome sequencing -- the process of determining the complete DNA sequence of an organism's genome -- has transformed biology. The Human Genome Project, completed in 2003, took 13 years and roughly three billion dollars. Today, a human genome can be sequenced in hours for under a thousand dollars. This dramatic reduction in cost and time has made sequencing a routine tool in research and clinical practice.
This chapter traces the path from raw DNA in a tube to biological insight on a screen. We begin with the sequencing technologies that generate data, then examine the computational methods that process and interpret that data, and finally explore the databases and workflows that make genomic analysis reproducible and shareable.
Sequencing Technologies
Sanger Sequencing: The Foundation
Sanger sequencing, developed by Frederick Sanger in 1977, was the first practical method for reading DNA sequences. The method uses dideoxynucleotides (ddNTPs) -- modified nucleotides that lack the 3' hydroxyl group needed for chain elongation. When a ddNTP is randomly incorporated during DNA synthesis, the growing chain terminates at that position.
The reaction produces fragments of every possible length, each ending at a known base. Modern automated Sanger sequencers separate these fragments by capillary electrophoresis and read the sequence from fluorescently labeled terminators. A single Sanger sequencing run produces 700-1,000 bases of high-quality sequence.
Sanger sequencing remains the gold standard for:
- Confirming specific mutations identified by other methods
- Sequencing individual PCR products
- Validating CRISPR edits and cloned constructs
- Reading short stretches with very high accuracy (99.999%)
Next-Generation Sequencing
Next-generation sequencing (NGS) technologies perform millions of sequencing reactions simultaneously, generating gigabases of data in a single run. Unlike Sanger sequencing, which reads one fragment at a time, NGS platforms massively parallelize the process.
Illumina sequencing is the most widely used NGS platform. It works through sequencing by synthesis:
- DNA is fragmented and adapters are ligated to the ends
- Fragments are amplified in clusters on a flow cell surface
- Fluorescently labeled nucleotides are added one at a time
- After each incorporation, the machine images the entire flow cell
- The color at each cluster position reveals the incorporated base
- The process repeats for 150-300 cycles, producing reads of 150-300 bases
Illumina sequencing generates extremely high throughput (hundreds of gigabases per run) with low per-base error rates (~0.1%), but read lengths are relatively short.
Long-read sequencing platforms from Pacific Biosciences (PacBio) and Oxford Nanopore Technologies produce reads of 10,000 to over 100,000 bases. PacBio uses single-molecule real-time (SMRT) sequencing, where a polymerase synthesizes DNA in a tiny well and fluorescent pulses from each incorporated nucleotide are detected in real time. Oxford Nanopore threads single DNA strands through a protein pore and measures changes in electrical current as each base passes through.
Long reads are essential for:
- Assembling repetitive genomic regions
- Detecting structural variants
- Phasing haplotypes
- Assembling complete genomes without gaps
| Technology | Read Length | Throughput per Run | Error Rate | Cost per Gb | Best For |
|---|---|---|---|---|---|
| Sanger | 700-1,000 bp | ~1 kb | 0.001% | Very high | Validation, single targets |
| Illumina | 150-300 bp | 100-6,000 Gb | 0.1% | Low | Variant calling, RNA-seq |
| PacBio HiFi | 10-25 kb | 30-100 Gb | 0.1% (HiFi) | Moderate | Genome assembly, structural variants |
| Oxford Nanopore | 10-100+ kb | 50-300 Gb | 1-5% | Moderate | Real-time sequencing, field work |
Sequencing Strategies
Not every experiment requires sequencing an entire genome. Different research questions call for different sequencing strategies.
Whole genome sequencing (WGS) determines the complete DNA sequence of an organism. WGS captures all variation -- coding, non-coding, regulatory, and structural -- but generates large datasets and requires substantial computational resources.
Whole exome sequencing (WES) targets only the protein-coding regions (exons), which comprise roughly 1-2% of the human genome but harbor about 85% of disease-causing mutations. WES uses hybridization capture probes to enrich exonic DNA before sequencing, reducing both cost and data volume compared to WGS.
Targeted sequencing focuses on specific genomic regions of interest, such as a panel of known disease genes. Clinical gene panels for conditions like hereditary cancer or cardiac arrhythmias typically sequence 50-500 genes at very high depth, providing high sensitivity for detecting variants in those regions.
Choosing a Sequencing Strategy
What does the data tell us? The choice between WGS, WES, and targeted sequencing depends on your question. Looking for a known disease mutation? A targeted panel is fastest and cheapest. Searching for novel non-coding regulatory variants? Only WGS will find them. Always match the sequencing strategy to the biological question.
Sequence Alignment
Once sequencing reads are generated, they must be mapped back to a reference genome or compared to other sequences. Sequence alignment is the computational process of determining where each read originated in the genome, or how two or more sequences correspond to each other.
Pairwise and Multiple Sequence Alignment
Pairwise alignment compares two sequences to find the best correspondence between them. Two classic algorithms accomplish this: the Needleman-Wunsch algorithm for global alignment (aligning sequences end-to-end) and the Smith-Waterman algorithm for local alignment (finding the best matching subsequence).
Both algorithms use dynamic programming to fill a scoring matrix. Each cell in the matrix represents the best alignment score up to that position, considering matches, mismatches, and gaps. The optimal alignment is found by tracing back through the matrix from the highest-scoring cell.
The scoring for a pairwise alignment uses a substitution matrix and gap penalties. For protein sequences, the BLOSUM62 matrix assigns positive scores to biochemically similar amino acid substitutions and negative scores to dissimilar ones. A simple scoring scheme for DNA might be:
with a gap penalty of \( -2 \) per gap position.
Multiple sequence alignment (MSA) extends pairwise alignment to three or more sequences simultaneously. MSA is essential for evolutionary analysis, identifying conserved functional domains, and building phylogenetic trees. Programs like Clustal Omega, MUSCLE, and MAFFT use heuristic approaches because exact multiple alignment is computationally prohibitive for more than a few sequences.
The BLAST Algorithm
The BLAST algorithm (Basic Local Alignment Search Tool) is the most widely used bioinformatics tool in the world. BLAST searches a database of sequences to find those similar to a query sequence. Unlike Smith-Waterman, BLAST uses a heuristic approach that sacrifices some sensitivity for enormous gains in speed.
BLAST works in three steps:
- Seeding: The query is broken into short "words" (typically 11 nucleotides or 3 amino acids), and exact matches are found in the database
- Extension: Each seed match is extended in both directions, keeping track of the alignment score
- Evaluation: The statistical significance of each alignment is assessed using an E-value -- the expected number of alignments with that score or better that would occur by chance in a database of that size
An E-value of \( 10^{-50} \) means you would need to search \( 10^{50} \) random databases before finding an equally good match by chance -- strong evidence of true homology. An E-value near 1 suggests the match could be coincidental.
Diagram: BLAST Algorithm Workflow
BLAST Algorithm Workflow
Type: flowchart
sim-id: blast-algorithm-workflow
Library: Mermaid
Status: Specified
A flowchart showing the three phases of the BLAST algorithm: (1) Seeding phase -- query broken into k-mer words, exact matches found in database index; (2) Extension phase -- seed matches extended bidirectionally using dynamic programming, low-scoring extensions dropped; (3) Evaluation phase -- surviving alignments scored, E-values calculated, results ranked. Include example input (query sequence) and output (ranked hit list with E-values).
Genomic File Formats
Bioinformatics relies on standardized file formats to store and exchange data. Understanding these formats is essential for working with genomic data.
Sequence Formats
The FASTA file format stores nucleotide or protein sequences in a simple text format. Each entry begins with a header line starting with > followed by the sequence identifier and optional description. The sequence follows on subsequent lines:
1 2 3 | |
The FASTQ file format extends FASTA by adding quality scores for each base. Each read occupies four lines: a header line (starting with @), the sequence, a separator line (+), and the quality string. Quality scores are encoded as ASCII characters, where each character represents the Phred quality score:
A Phred score of 30 (ASCII character ?) means a 1 in 1,000 chance of error (99.9% accuracy). A score of 40 means 1 in 10,000 (99.99% accuracy).
1 2 3 4 | |
Alignment and Annotation Formats
The BAM file format (Binary Alignment Map) stores the results of aligning sequencing reads to a reference genome. BAM is the compressed binary version of the SAM (Sequence Alignment Map) format. Each record in a BAM file contains the read sequence, its quality scores, the reference position where it aligned, the alignment quality (mapping quality), and a CIGAR string that describes matches, mismatches, insertions, and deletions.
The BED file format (Browser Extensible Data) stores genomic coordinate ranges in a simple tab-delimited format. Each line specifies a chromosome, start position, end position, and optional additional fields like name, score, and strand. BED files are used to define genomic regions of interest, such as exons, enhancers, or ChIP-seq peaks.
| Format | Extension | Content | Text/Binary | Typical Size (Human WGS) |
|---|---|---|---|---|
| FASTA | .fa, .fasta | Reference sequences | Text | 3 GB (reference genome) |
| FASTQ | .fq, .fastq | Reads + quality scores | Text | 50-200 GB |
| BAM | .bam | Aligned reads | Binary | 30-100 GB |
| BED | .bed | Genomic intervals | Text | KB to MB |
| VCF | .vcf | Variant calls | Text | 100 MB - 1 GB |
File Format Tip
Think of these file formats as stages in a pipeline. Raw reads come off the sequencer as FASTQ files. Alignment maps those reads to a reference, producing BAM files. Variant calling compares the aligned reads to the reference, producing VCF files. Each format captures a different stage of the analysis.
Genome Annotation
Genome annotation is the process of identifying the functional elements in a genome -- genes, regulatory regions, non-coding RNAs, and other features. Annotation transforms a raw sequence of nucleotides into a meaningful biological map.
Gene Prediction
Gene prediction uses computational methods to identify protein-coding genes within a genome sequence. Two main approaches are used:
- Ab initio prediction uses statistical models of gene structure (splice sites, start codons, codon usage) to identify likely genes directly from the sequence. Programs like Augustus and GenScan use hidden Markov models trained on known gene structures.
- Homology-based prediction aligns known genes or proteins from related organisms to the target genome. If a region of the genome aligns well to a known gene, it likely encodes a similar gene. Programs like Exonerate and GeneWise perform spliced alignment of proteins to genomic DNA.
Modern annotation pipelines combine both approaches with evidence from RNA-seq data (which directly reveals expressed transcripts) to produce comprehensive gene models.
Functional Annotation and Gene Ontology
Functional annotation assigns biological roles to predicted genes. This involves comparing predicted protein sequences to databases of characterized proteins, identifying conserved domains, and predicting subcellular localization, enzymatic activity, and interaction partners.
The Gene Ontology (GO) provides a standardized vocabulary for describing gene function across species. GO organizes functional descriptions into three hierarchical categories:
- Molecular Function -- what the gene product does biochemically (e.g., "protein kinase activity")
- Biological Process -- the larger process the gene contributes to (e.g., "signal transduction")
- Cellular Component -- where the gene product is located (e.g., "nucleus")
Each GO term has a unique identifier (e.g., GO:0006915 for "apoptotic process") and is connected to broader and narrower terms in a directed acyclic graph. GO annotations are supported by evidence codes that indicate how the annotation was determined -- experimentally, computationally, or by curator inference.
Variant Calling and Classification
Variant calling is the process of identifying positions where a sequenced genome differs from the reference. This is the critical step that connects sequencing data to biological and clinical meaning.
The variant calling pipeline proceeds as follows:
- Align reads to the reference genome (producing a BAM file)
- Identify positions where the aligned reads differ from the reference
- Distinguish true variants from sequencing errors using statistical models
- Filter variants based on quality metrics (read depth, mapping quality, strand bias)
- Output variants in a standardized format
The VCF File Format
The VCF file format (Variant Call Format) is the standard for recording genomic variants. Each variant record contains the chromosome, position, reference allele, alternate allele(s), quality score, filter status, and additional annotations. The VCF header defines metadata and the INFO fields used in the file.
A simplified VCF record looks like:
1 2 | |
This record indicates a heterozygous A-to-T variant at position 117,559,593 on chromosome 7, supported by 45 reads, with the alternate allele present in approximately 50% of reads.
Variant Annotation and Classification
Variant annotation adds biological context to each variant call -- determining whether it falls within a gene, what effect it has on the encoded protein, and whether it has been seen before in databases. Tools like VEP (Variant Effect Predictor), SnpEff, and ANNOVAR perform this annotation.
Variant classification is the process of assessing whether a variant is likely to cause disease. The American College of Medical Genetics and Genomics (ACMG) guidelines define five categories:
- Pathogenic -- the variant is disease-causing, supported by strong evidence
- Likely pathogenic -- strong evidence suggests disease causation but some uncertainty remains
- Variant of uncertain significance (VUS) -- insufficient evidence to classify as pathogenic or benign
- A benign variant is one that does not cause disease
- Likely benign -- strong evidence suggests the variant is not disease-causing
A pathogenic variant typically has multiple lines of supporting evidence: it is rare in the population, it disrupts a known disease gene, functional studies show it impairs protein function, and it co-segregates with disease in families.
The variant of uncertain significance category is particularly challenging in clinical genomics. As more population data and functional studies accumulate, many VUS will eventually be reclassified as either pathogenic or benign, but until then they cannot guide clinical decisions.
The VUS Challenge
Variants of uncertain significance are one of the most difficult challenges in clinical genomics. Reporting a VUS to a patient requires careful communication -- it does not mean the variant is harmless, and it does not mean it causes disease. As databases grow and functional assays improve, many VUS will be reclassified, but this process can take years.
Genomic Databases
Genomic databases store, organize, and provide access to the vast amounts of sequence, annotation, and variation data generated by the genomics community. Knowing which database to query -- and how to interpret the results -- is a core bioinformatics skill.
Major Databases
The NCBI database (National Center for Biotechnology Information) hosts a suite of interconnected databases including GenBank (nucleotide sequences), RefSeq (curated reference sequences), PubMed (literature), and the Sequence Read Archive (raw sequencing data). NCBI's Entrez system links records across these databases.
The Ensembl database, maintained by the European Bioinformatics Institute, provides comprehensive genome annotation for vertebrate and other eukaryotic genomes. Ensembl is particularly strong for comparative genomics, regulatory annotation, and variation data.
The UCSC Genome Browser is an interactive web tool for visualizing genomic data. Researchers can view gene models, conservation tracks, regulatory elements, and custom data tracks aligned to a reference genome. The UCSC browser's track hub system allows labs to share their own data as public or private tracks.
For clinical genomics, two specialized databases are essential:
- ClinVar database collects relationships between human genetic variants and clinical phenotypes. Each submission includes the variant, the associated condition, and the submitter's classification (pathogenic, benign, VUS, etc.).
- dbSNP database catalogs short sequence variations (SNPs and small indels) across species. Each variant receives an rs identifier (e.g., rs113488022) that serves as a universal reference.
| Database | Hosted By | Primary Content | Key Use |
|---|---|---|---|
| NCBI / GenBank | NIH (USA) | All public sequences | Sequence search, BLAST |
| Ensembl | EBI (Europe) | Annotated genomes | Comparative genomics |
| UCSC Genome Browser | UC Santa Cruz | Genome visualization | Data browsing, track analysis |
| ClinVar | NIH (USA) | Clinical variant interpretations | Clinical variant lookup |
| dbSNP | NIH (USA) | Cataloged sequence variants | Variant identification |
Diagram: Genomics Database Ecosystem
Genomics Database Ecosystem
Type: network diagram
sim-id: genomics-database-ecosystem
Library: vis-network
Status: Specified
An interactive network diagram showing major genomic databases as nodes (NCBI, Ensembl, UCSC, ClinVar, dbSNP, Gene Ontology) with edges representing data flows and cross-references between them. Node size proportional to database scope. Color-code nodes by type: sequence databases (blue), annotation databases (green), clinical databases (red), ontology databases (purple). Include click interactions to show what each database stores.
Reproducible Workflows and Pipeline Automation
Modern genomics analyses involve dozens of computational steps, each with specific software, parameters, and dependencies. Reproducible workflows ensure that an analysis can be repeated exactly, by the same researcher months later or by a different lab entirely.
Why Reproducibility Matters
A typical variant calling pipeline might include:
- Quality control of raw reads (FastQC, Trimmomatic)
- Alignment to the reference genome (BWA-MEM, STAR)
- Post-alignment processing (duplicate marking, base quality recalibration)
- Variant calling (GATK HaplotypeCaller, DeepVariant)
- Variant filtering and annotation (VEP, ClinVar lookup)
Each step has parameters that affect the results. Changing the alignment software, the quality filters, or even the reference genome version can change which variants are called. Without careful documentation, results become irreproducible.
Computational Workflow Management
A computational workflow formally defines the steps of an analysis, their inputs and outputs, and their dependencies. Workflow management systems like Snakemake, Nextflow, and WDL (Workflow Description Language) allow researchers to define pipelines as code, making them:
- Reproducible -- the same workflow on the same data produces the same results
- Portable -- workflows run on laptops, clusters, or cloud platforms
- Scalable -- workflows automatically parallelize across available resources
- Documentable -- the workflow definition itself is the protocol
Pipeline automation uses these systems to run multi-step analyses without manual intervention. A well-designed pipeline checks input data quality, logs every step, handles errors gracefully, and produces standardized output.
Version control in genomics extends the software engineering practice of tracking changes to code (using tools like Git) to the entire analysis environment -- including software versions, reference data, and configuration files. Container technologies like Docker and Singularity capture the complete computational environment, ensuring that an analysis can be reproduced even as software evolves.
RNA-Seq Analysis
RNA-seq analysis uses sequencing to measure gene expression levels across the entire transcriptome. By counting how many sequencing reads map to each gene, researchers quantify which genes are active in a sample and at what level.
The RNA-seq workflow involves:
- Extract RNA and convert to cDNA
- Fragment, add adapters, and sequence (typically Illumina)
- Align reads to the reference genome or transcriptome
- Count reads per gene
- Normalize counts to account for gene length and sequencing depth
- Compare expression between conditions
Differential Expression
Differential expression analysis identifies genes whose expression levels differ significantly between two or more experimental conditions. Statistical tools like DESeq2 and edgeR model the count data using a negative binomial distribution and test for significant differences while controlling for multiple testing.
A gene is typically considered differentially expressed if it meets two criteria: a fold change threshold (e.g., \(\geq 2\)-fold change between conditions) and a statistical significance threshold (e.g., adjusted p-value \( < 0.05 \) after Benjamini-Hochberg correction for multiple testing).
The fold change for a gene is calculated as:
Results are often visualized as volcano plots (log fold change vs. negative log p-value) or MA plots (average expression vs. log fold change).
Pathway Enrichment Analysis
Pathway enrichment analysis asks whether the differentially expressed genes are enriched for particular biological functions or pathways. Rather than interpreting genes one at a time, enrichment analysis identifies coordinated changes in functional categories.
The basic approach uses the Gene Ontology or pathway databases (like KEGG or Reactome) and a statistical test (typically a hypergeometric test or Fisher's exact test) to determine whether a GO term or pathway appears more often in the differentially expressed gene list than expected by chance.
For example, if 50 out of 500 differentially expressed genes are annotated with the GO term "immune response," but only 200 out of 20,000 total genes carry that annotation, the enrichment is highly significant -- the immune response pathway is specifically affected in the experimental condition.
Data Interpretation: From Numbers to Biology
Data interpretation is the final and most critical step in any bioinformatics analysis. Raw numbers -- read counts, p-values, variant calls -- must be translated into biological understanding.
Key principles for interpreting genomic data include:
- Statistical significance is not biological significance -- a gene with a tiny p-value but a 1.1-fold change may be statistically significant but biologically irrelevant
- Context matters -- a variant in a gene with no known function requires different interpretation than a variant in a well-characterized disease gene
- Replication is essential -- findings from one dataset should be validated in independent samples
- Multiple testing correction is mandatory -- when testing thousands of genes or variants simultaneously, raw p-values are misleading without correction
- Beware batch effects -- technical differences between sample processing batches can create false biological signals
Chapter Complete!
Outstanding work, fellow investigators! You now have a solid foundation in the technologies and computational methods that drive modern genomics. From the sequencer to the database, you understand how raw DNA becomes actionable knowledge. In the next chapter, we will apply these tools to human genetics and clinical medicine -- where sequencing data directly impacts patient care.