Bioinformatics Data Formats
Summary
Covers the standard file formats used throughout bioinformatics pipelines including FASTA, FASTQ, GenBank, GFF3, VCF, SAM/BAM, PDB, SBML, and BioPAX, plus data quality control and format conversion practices.
Concepts Covered
This chapter covers the following 15 concepts from the learning graph:
- FASTA Format
- FASTQ Format
- GenBank Format
- GFF3 Format
- OWL Format
- PDB File Format
- VCF Format
- SAM and BAM Format
- BED Format
- SBML Format
- BioPAX Format
- CSV for Bioinformatics
- JSON for Bioinformatics
- Data Format Conversion
- Data Quality Control
Prerequisites
This chapter builds on concepts from:
Welcome, Explorers! Let's Connect the Dots!
In Chapter 2 we explored the databases that store biological data. Now we need to understand the languages those databases speak — the file formats that encode sequences, annotations, variants, structures, and pathways. Mastering these formats is essential because every bioinformatics pipeline begins by reading data in one format and often ends by writing results in another. Let's connect the dots between raw data and the structured files that carry it!
Why Data Formats Matter
Bioinformatics is fundamentally a data science, and data must be structured before it can be analyzed. A DNA sequence is just a string of characters until it is wrapped in a format that records its name, its source organism, its quality scores, and its position in a genome. The file formats covered in this chapter are the shared conventions that allow sequencing instruments, alignment tools, variant callers, structure viewers, and pathway analysis programs to exchange information reliably.
Choosing the wrong format — or misunderstanding the one you have — leads to silent errors that can invalidate an entire analysis. A quality score misinterpreted by one encoding offset will produce incorrect variant calls. A GFF3 file parsed as GFF2 will scramble gene models. A VCF file missing its header lines will be rejected by every downstream tool. Understanding data formats is not glamorous work, but it is the foundation on which all reliable bioinformatics rests.
The formats in this chapter fall into several broad categories:
- Sequence formats — store nucleotide or amino acid sequences (FASTA, FASTQ, GenBank)
- Annotation formats — describe features on a reference sequence (GFF3, BED)
- Alignment formats — record how reads map to a reference genome (SAM, BAM)
- Variant formats — catalog genetic differences relative to a reference (VCF)
- Structure formats — represent three-dimensional atomic coordinates (PDB)
- Pathway and ontology formats — encode biological networks and controlled vocabularies (SBML, BioPAX, OWL)
- General-purpose formats — tabular and hierarchical data exchange (CSV, JSON)
| Category | Formats | Typical File Size |
|---|---|---|
| Sequence | FASTA, FASTQ, GenBank | KB to hundreds of GB |
| Annotation | GFF3, BED | KB to hundreds of MB |
| Alignment | SAM, BAM | GB to TB |
| Variant | VCF | MB to GB |
| Structure | PDB/mmCIF | KB to tens of MB |
| Pathway/Ontology | SBML, BioPAX, OWL | KB to hundreds of MB |
| General-purpose | CSV, JSON | Variable |
FASTA Format
The FASTA Format is the simplest and most widely used sequence format in bioinformatics. Originally developed for the FASTA sequence alignment program in 1985, it has become the universal plain-text format for nucleotide and amino acid sequences. Nearly every bioinformatics tool accepts FASTA input, and nearly every sequence database can export FASTA output.
A FASTA file consists of one or more sequence records. Each record has two parts: a header line beginning with > followed by a description, and one or more lines of sequence characters. There is no fixed line length, but 60 or 80 characters per line is conventional.
1 2 3 4 5 6 7 8 9 | |
Key characteristics of the FASTA Format include:
- The
>character must be the first character on the header line - Everything after
>on the header line is the sequence description (no enforced structure, though database-specific conventions exist) - Sequence lines contain only valid characters:
A,C,G,T,Nfor DNA; the 20 standard amino acid one-letter codes for protein - Blank lines between records are permitted but not required
- The format carries no quality information — it records the sequence and nothing more
FASTA is appropriate when you need a clean, portable representation of one or more sequences. It is the input format for BLAST searches, multiple sequence alignment programs like Clustal Omega and MAFFT, and genome assembly tools. However, because FASTA lacks quality scores, it is not suitable for representing raw sequencing reads — that role belongs to FASTQ.
FASTQ Format
The FASTQ Format extends FASTA by adding per-base quality scores, making it the standard format for raw sequencing data from Illumina, PacBio, Oxford Nanopore, and other platforms. Every sequencing run produces FASTQ files, and they are typically the first files you encounter in any genomics pipeline.
A FASTQ record contains exactly four lines:
1 2 3 4 | |
The four lines encode:
- Header line — begins with
@, contains a sequence identifier and optional description - Sequence line — the nucleotide sequence (same characters as FASTA)
- Separator line — a
+character, optionally followed by a repeat of the identifier - Quality line — ASCII-encoded quality scores, one character per base, same length as the sequence line
The quality scores use the Phred scale, where the quality value \(Q\) relates to the probability of an incorrect base call \(P\) by:
A Phred score of 20 means a 1 in 100 chance of error (99% accuracy), while a score of 30 means a 1 in 1,000 chance (99.9% accuracy). Modern Illumina instruments routinely produce reads with average quality scores above 30.
| Phred Score | Error Probability | Accuracy | ASCII (Phred+33) |
|---|---|---|---|
| 10 | 1 in 10 | 90% | + |
| 20 | 1 in 100 | 99% | 5 |
| 30 | 1 in 1,000 | 99.9% | ? |
| 40 | 1 in 10,000 | 99.99% | I |
The ASCII encoding adds an offset to the Phred score to produce a printable character. The current standard (Sanger/Illumina 1.8+) uses an offset of 33, so a Phred score of 30 is encoded as ASCII character 63, which is ?. Older Illumina pipelines (versions 1.3 through 1.7) used an offset of 64 — a source of many historic bugs in bioinformatics pipelines.
What's the Link?
Notice how FASTQ builds on FASTA by adding exactly one new dimension — quality. This layered design is a pattern you will see throughout bioinformatics formats. Simple formats capture the essentials, and more complex formats extend them with additional metadata. When you encounter a new format, ask yourself: what information does this format add beyond what simpler formats already capture?
GenBank Format
The GenBank Format is the richly annotated flat-file format used by NCBI's GenBank database. While FASTA stores only the sequence and FASTQ adds quality scores, GenBank records contain comprehensive metadata including the organism, references to published literature, feature annotations (coding sequences, regulatory regions, variation sites), and cross-references to other databases.
A simplified GenBank record looks like this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 | |
The major sections of a GenBank record are:
- LOCUS — sequence name, length, molecule type, topology, division, and date
- DEFINITION — a brief description of the sequence
- ACCESSION / VERSION — the stable identifier and its version number
- FEATURES — the annotation table, where each feature has a type (gene, CDS, mRNA, etc.), a location on the sequence, and qualifier key-value pairs
- ORIGIN — the nucleotide sequence itself, numbered by position
- // — the record terminator
GenBank format is essential for understanding annotated reference sequences. When you download a gene from NCBI, the GenBank record tells you not just the sequence but where the coding sequence starts and ends, what protein it encodes, and which papers first described it.
GFF3 Format
The GFF3 Format (Generic Feature Format version 3) is a tab-delimited text format for describing features on a reference sequence — genes, exons, regulatory elements, binding sites, and any other genomic annotation. GFF3 is the standard output of genome annotation pipelines and the input to genome browsers and downstream analysis tools.
Each line in a GFF3 file contains nine tab-separated columns:
1 2 3 4 5 6 | |
| Column | Name | Description |
|---|---|---|
| 1 | seqid | Chromosome or scaffold name |
| 2 | source | Program or database that generated the feature |
| 3 | type | Feature type (gene, mRNA, exon, CDS, etc.) |
| 4 | start | Start position (1-based, inclusive) |
| 5 | end | End position (1-based, inclusive) |
| 6 | score | Numerical score (or . if not applicable) |
| 7 | strand | + (forward), - (reverse), or . (unstranded) |
| 8 | phase | For CDS features: 0, 1, or 2 (reading frame offset) |
| 9 | attributes | Semicolon-separated key=value pairs (ID, Name, Parent, etc.) |
The critical innovation of GFF3 over earlier formats is the Parent attribute, which creates hierarchical relationships: a gene contains mRNAs, each mRNA contains exons and CDS features. This parent-child structure forms a directed acyclic graph that mirrors the biological reality of gene structure.
BED Format
The BED Format (Browser Extensible Data) is a simpler, more compact alternative to GFF3 for representing genomic intervals. Originally designed for the UCSC Genome Browser, BED has become a widely used format for any analysis that works with genomic regions — peak calling in ChIP-seq, defining target regions for exome capture, and specifying intervals for coverage analysis.
A BED file is tab-delimited with a minimum of three columns and up to twelve optional columns:
1 2 3 | |
The first three columns are required:
- chrom — chromosome name
- chromStart — start position (0-based)
- chromEnd — end position (exclusive)
A crucial difference between BED and GFF3 is coordinate convention. BED uses zero-based, half-open coordinates (the start is inclusive and the end is exclusive), while GFF3 uses one-based, fully closed coordinates (both start and end are inclusive). A feature from position 100 to 200 is written as 100 200 in BED but 101 200 in GFF3. This difference is one of the most common sources of off-by-one errors in bioinformatics, and tools like bedtools and UCSC liftOver are designed with BED conventions in mind.
Watch Out!
The difference between zero-based half-open (BED) and one-based fully closed (GFF3) coordinates causes more subtle bugs than almost any other issue in genomics. When converting between formats, always adjust coordinates explicitly. A feature at BED coordinates chr1 99 200 corresponds to GFF3 coordinates chr1 100 200. Off-by-one errors here can shift every variant call, every exon boundary, and every peak summit in your analysis.
VCF Format
The VCF Format (Variant Call Format) is the standard for representing genetic variants — single nucleotide polymorphisms (SNPs), insertions, deletions, and structural variants — relative to a reference genome. VCF is the output of variant calling pipelines like GATK, DeepVariant, and bcftools, and the input to variant annotation, filtering, and association analysis tools.
A VCF file has three sections: meta-information lines (beginning with ##), a header line (beginning with #), and data lines:
1 2 3 4 5 6 7 | |
The eight fixed columns are:
| Column | Description |
|---|---|
| CHROM | Chromosome |
| POS | 1-based position of the variant |
| ID | Variant identifier (e.g., rsID from dbSNP) |
| REF | Reference allele |
| ALT | Alternate allele(s), comma-separated if multiallelic |
| QUAL | Phred-scaled quality score for the variant call |
| FILTER | PASS or a list of filters that the variant failed |
| INFO | Semicolon-separated key=value annotations |
After the fixed columns, VCF files for multi-sample studies include a FORMAT column defining per-sample fields, followed by one column per sample. The genotype field GT uses 0 for the reference allele and 1, 2, etc. for alternate alleles, separated by / (unphased) or | (phased). Thus 0/1 indicates a heterozygous call, 1/1 a homozygous alternate, and 0/0 a homozygous reference.
SAM and BAM Format
The SAM and BAM Format pair represents sequence alignments — the result of mapping sequencing reads to a reference genome. SAM (Sequence Alignment/Map) is the human-readable text format, while BAM is its binary, compressed equivalent. In practice, most pipelines produce BAM files because they are 3-5 times smaller than SAM and support random access through index files (BAI).
A SAM file begins with an optional header section (lines starting with @) followed by alignment records, one per line with eleven mandatory tab-separated fields:
1 2 3 4 5 | |
Key SAM fields include:
| Field | Name | Description |
|---|---|---|
| 1 | QNAME | Read name |
| 2 | FLAG | Bitwise flag encoding read properties (paired, mapped, strand, etc.) |
| 3 | RNAME | Reference sequence name (chromosome) |
| 4 | POS | 1-based leftmost mapping position |
| 5 | MAPQ | Mapping quality (Phred-scaled) |
| 6 | CIGAR | Compact representation of the alignment (e.g., 150M = 150 matches) |
| 10 | SEQ | Read sequence |
| 11 | QUAL | Base quality string (same encoding as FASTQ) |
The FLAG field is a bitwise integer where each bit indicates a property of the read. For example, flag 99 = 64 + 32 + 2 + 1, meaning the read is paired (1), properly paired (2), the mate is reverse strand (32), and this is the first read in the pair (64). Decoding flags is essential for filtering alignments — tools like samtools view -f and samtools view -F use flag values to include or exclude reads.
The CIGAR string encodes the alignment structure compactly. Common operations include M (match or mismatch), I (insertion to the reference), D (deletion from the reference), S (soft clipping), and N (skipped region, used for spliced alignments in RNA-seq). A CIGAR of 50M2I48M means 50 aligned bases, a 2-base insertion, then 48 more aligned bases.
PDB File Format
The PDB File Format stores three-dimensional atomic coordinates of biological macromolecules — proteins, nucleic acids, and their complexes with small-molecule ligands. Originally developed for the Protein Data Bank in the 1970s, the fixed-column-width PDB format has been largely superseded by the more flexible mmCIF (macromolecular Crystallographic Information File) format for new depositions. However, PDB format files remain widely used and understood.
A PDB file contains several record types, but the most important for structural analysis are ATOM and HETATM records:
1 2 3 4 5 6 | |
Each ATOM record specifies:
- Atom serial number — sequential counter
- Atom name — e.g., CA for alpha carbon, N for backbone nitrogen
- Residue name — three-letter amino acid code (MET, ALA, etc.) or nucleotide code
- Chain identifier — letter designating the polypeptide or nucleic acid chain
- Residue sequence number — position in the chain
- X, Y, Z coordinates — in Angstroms
- Occupancy and B-factor — measures of atomic position certainty
The PDB format uses fixed column positions — the X coordinate always occupies columns 31-38, for example. This rigid structure makes parsing straightforward but limits extensibility, which is why the community has moved toward mmCIF for new depositions. Structural analysis tools like PyMOL, ChimeraX, and Biopython's PDB parser can read both formats.
Let's Connect the Dots!
Notice the chain identifier in PDB files — a single letter that tells you which polypeptide chain each atom belongs to. When a PDB structure contains multiple protein chains, these chain identifiers define the subunits of a complex. In protein interaction network analysis, a co-crystal structure with chains A and B provides direct structural evidence for a physical interaction between those two proteins. The PDB format thus encodes graph edges (interactions) implicitly through chain co-occurrence in a single structure file.
Pathway and Ontology Formats
Biological knowledge extends beyond sequences and structures to encompass metabolic pathways, signaling networks, and controlled vocabularies. Three XML-based formats dominate this space: SBML, BioPAX, and OWL.
SBML Format
The SBML Format (Systems Biology Markup Language) is an XML-based format for representing computational models of biological processes, particularly metabolic and signaling networks. SBML encodes species (molecules), reactions (transformations), compartments (cellular locations), and mathematical rate laws that govern reaction kinetics.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | |
SBML is the standard exchange format for systems biology models. Repositories like BioModels and the BioCyc database distribute models in SBML format, and simulation tools like COPASI and CellDesigner read and write SBML natively. From a graph perspective, an SBML model defines a bipartite graph: species nodes are connected to reaction nodes by reactant and product edges.
BioPAX Format
The BioPAX Format (Biological Pathway Exchange) is an OWL-based ontology and data format designed specifically for exchanging biological pathway data between databases. While SBML focuses on quantitative simulation, BioPAX focuses on qualitative representation of pathway knowledge — which molecules participate in which reactions, what controls those reactions, and how pathways connect.
BioPAX defines a class hierarchy of biological entities (Protein, SmallMolecule, Complex, Gene) and interactions (BiochemicalReaction, Catalysis, Control, Transport). Pathway databases like Reactome, KEGG, and BioCyc export their data in BioPAX format, making it possible to integrate pathway knowledge from multiple sources into a single network.
BioPAX uses three levels of increasing complexity:
- Level 1 — metabolic pathways only
- Level 2 — adds molecular interactions and gene regulation
- Level 3 — adds genetic interactions, molecular location, and generics
OWL Format
The OWL Format (Web Ontology Language) is not specific to bioinformatics but is the foundation for many biological ontologies. OWL is a W3C standard for defining ontologies — formal descriptions of concepts, their properties, and the relationships between them. The Gene Ontology, Disease Ontology, Human Phenotype Ontology, and BioPAX itself are all expressed in OWL.
An OWL ontology defines classes (concepts), properties (relationships between concepts), and individuals (instances of classes). OWL supports logical axioms such as class subsumption (is-a relationships), property restrictions (all values must come from a given class), and equivalence (two classes have exactly the same members). These axioms enable automated reasoning — an OWL reasoner can infer new relationships that are logically implied but not explicitly stated.
In bioinformatics, OWL is important because it provides the formal underpinning for knowledge graphs. When you represent gene functions using GO terms, diseases using DO terms, and phenotypes using HPO terms, you are working with OWL ontologies that define how those terms relate to each other in a directed acyclic graph.
General-Purpose Formats: CSV and JSON
Not all bioinformatics data fits neatly into domain-specific formats. Many analysis results, metadata tables, and intermediate files use general-purpose formats.
CSV for Bioinformatics
CSV for Bioinformatics (Comma-Separated Values, or its tab-separated variant TSV) is the workhorse format for tabular data. Gene expression matrices, differential expression results, annotation tables, sample metadata, and many database downloads arrive as CSV or TSV files.
1 2 3 4 | |
Best practices for CSV in bioinformatics include:
- Use TSV (tab-delimited) rather than CSV when field values may contain commas (e.g., gene descriptions)
- Always include a header row with column names
- Use consistent missing value codes (
NA, not empty cells or mixed representations) - Quote fields that contain delimiter characters
- Store genomic coordinates with explicit column names indicating the coordinate system (0-based or 1-based)
JSON for Bioinformatics
JSON for Bioinformatics (JavaScript Object Notation) has become increasingly important as biological databases shift from flat files to web APIs. REST API responses from NCBI, UniProt, Ensembl, and other databases are typically returned in JSON format. JSON's hierarchical structure makes it well-suited for representing nested biological data — a gene with multiple transcripts, each containing multiple exons, maps naturally to nested JSON objects.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | |
JSON is human-readable, well-supported by every programming language, and handles heterogeneous data structures that would be awkward in tabular formats. However, JSON files are larger than equivalent binary formats, and parsing large JSON files (tens of gigabytes) can be memory-intensive. For very large datasets, line-delimited JSON (JSONL), where each line is an independent JSON object, offers a good compromise between structure and scalability.
Data Format Conversion
Real bioinformatics pipelines rarely use a single format from start to finish. Data Format Conversion — translating data from one format to another — is one of the most common tasks in day-to-day bioinformatics work. Every conversion requires understanding the source format, the target format, and what information may be lost or gained in translation.
Common conversion scenarios include:
| From | To | Tool(s) | Notes |
|---|---|---|---|
| FASTQ | FASTA | seqtk, awk | Discards quality scores |
| SAM | BAM | samtools view | Lossless binary compression |
| BAM | FASTQ | samtools fastq, bedtools | Extracts reads from alignments |
| GFF3 | BED | gff2bed (BEDOPS) | Adjusts coordinate system (1-based to 0-based) |
| BED | GFF3 | bed2gff | Adjusts coordinate system (0-based to 1-based) |
| GenBank | FASTA | BioPython, seqret (EMBOSS) | Extracts sequence only |
| GenBank | GFF3 | bp_genbank2gff3 (BioPerl) | Extracts feature annotations |
| VCF | TSV | bcftools query | Custom column extraction |
| PDB | mmCIF | gemmi, BioPython | Modern structure format |
| SBML | BioPAX | Specialized converters | May lose quantitative details |
Pro Tip from Olli
Before writing your own format conversion script, check whether a well-tested tool already exists. samtools, bedtools, bcftools, seqtk, and BioPython handle the vast majority of common conversions. These tools have been used by thousands of researchers and handle edge cases (malformed records, missing fields, encoding differences) that a quick custom script will miss. Only write custom parsers when you have a genuinely unusual format.
Key principles for reliable format conversion:
- Validate before converting — check that the input file conforms to its format specification before feeding it to a converter
- Understand coordinate systems — always verify whether positions are 0-based or 1-based, and whether intervals are half-open or closed
- Check for information loss — converting from a richer format to a simpler one (e.g., GenBank to FASTA) necessarily discards information; make sure you do not need what is lost
- Preserve metadata — when possible, carry forward sample names, quality metrics, and provenance information through each conversion step
- Verify round-trip fidelity — if you convert from A to B and back to A, the result should match the original (when the formats have comparable expressiveness)
Data Quality Control
Data Quality Control encompasses the practices and tools used to assess, filter, and improve the quality of bioinformatics data before analysis. Quality control is not a single step but a continuous process applied at every stage of a pipeline — from raw sequencing reads through final variant calls or expression estimates.
Quality Control for Sequencing Data
The most common quality control step in bioinformatics is assessing raw FASTQ files with tools like FastQC and MultiQC. FastQC produces a series of diagnostic plots for each FASTQ file.
Key quality metrics for sequencing data include:
- Per-base quality scores — quality should remain high (above Phred 20-30) across the read length; a drop at the 3' end is common and may require trimming
- Per-sequence quality distribution — the majority of reads should have high average quality
- Sequence content — base composition should be roughly uniform (for whole-genome data) or match expected patterns (for targeted sequencing)
- GC content — the distribution should match the expected GC content of the organism; bimodal distributions suggest contamination
- Adapter content — sequencing adapters should be removed by trimming tools (Trimmomatic, fastp, cutadapt) before alignment
- Duplication levels — high duplication rates may indicate PCR amplification bias or low library complexity
Diagram: FASTQ Quality Control Pipeline
Diagram Specification
- Sim ID: fastq-qc-pipeline
- Library: vis-network
- Status: Specified
This interactive flowchart diagram shows the steps in a FASTQ quality control pipeline as a directed graph. Nodes represent processing steps (Raw FASTQ, FastQC Assessment, Adapter Trimming, Quality Trimming, Length Filtering, Clean FASTQ, MultiQC Report). Edges are directed arrows showing the data flow. Decision nodes indicate pass/fail quality checks. Color coding: input/output nodes in blue, QC assessment nodes in green, trimming/filtering nodes in orange, decision nodes in yellow. Layout is top-to-bottom to show pipeline flow.
Note: Apply slight y-offset (from 480 to 490) for any horizontal edges to avoid the vis-network label rendering bug.
Quality Control for Alignments and Variants
Quality control extends beyond raw reads to every downstream data product:
- Alignment QC — samtools flagstat and Picard CollectAlignmentSummaryMetrics report mapping rates, duplicate rates, and insert size distributions for BAM files
- Variant QC — bcftools stats reports variant counts by type (SNP, indel), transition/transversion ratios (\(Ti/Tv\) ratio should be approximately 2.0-2.1 for whole-genome SNPs in humans), and heterozygosity rates
- Annotation QC — checking that GFF3 files have valid parent-child relationships, that all feature types are from controlled vocabularies, and that coordinates fall within chromosome boundaries
A robust quality control strategy follows a consistent pattern:
- Assess — run diagnostic tools and generate quality reports
- Evaluate — compare metrics against known thresholds and expectations
- Filter or correct — trim low-quality bases, remove adapter sequences, mark duplicates, or exclude failed samples
- Re-assess — verify that filtering improved data quality without excessive data loss
- Document — record all QC parameters, thresholds, and tool versions for reproducibility
| Pipeline Stage | QC Tools | Key Metrics |
|---|---|---|
| Raw reads (FASTQ) | FastQC, MultiQC | Per-base quality, adapter content, GC distribution |
| Alignment (BAM) | samtools flagstat, Picard | Mapping rate, duplicate rate, insert size |
| Variants (VCF) | bcftools stats, VariantQC | Ti/Tv ratio, het/hom ratio, call rate |
| Annotations (GFF3) | GenomeTools gff3validator | Parent-child integrity, valid feature types |
Great Work, Investigators!
You now speak the languages of bioinformatics data! From the simplicity of FASTA to the complexity of SBML and BioPAX, you understand the formats that carry sequences, qualities, annotations, variants, structures, and pathways through analysis pipelines. More importantly, you know that converting between these formats, validating their contents, and controlling data quality are skills just as critical as running the analysis itself. In the chapters ahead, we will put these formats to work — reading them, writing them, and building graphs from the relationships they encode. Onward!
Key Takeaways
-
FASTA is the simplest sequence format, storing only a header and a sequence. It is universally accepted but carries no quality information.
-
FASTQ extends FASTA with per-base Phred quality scores, making it the standard format for raw sequencing reads. The quality encoding (\(Q = -10 \cdot \log_{10}(P)\)) quantifies base-call confidence.
-
GenBank format provides richly annotated sequence records including organism metadata, literature references, feature annotations, and cross-references to other databases.
-
GFF3 is the standard tab-delimited format for genomic annotations, using 1-based fully closed coordinates and Parent attributes to define hierarchical gene models.
-
BED format uses 0-based half-open coordinates for genomic intervals — the coordinate difference from GFF3 is a persistent source of off-by-one errors.
-
VCF is the standard format for genetic variant representation, encoding SNPs, insertions, deletions, and structural variants relative to a reference genome with per-sample genotype information.
-
SAM/BAM files store sequence alignments, with BAM providing compressed binary storage and random access. The FLAG field and CIGAR string are essential for interpreting alignment records.
-
PDB format stores three-dimensional atomic coordinates of biological macromolecules using fixed-column-width records, with chain identifiers that implicitly encode protein complex composition.
-
SBML encodes quantitative systems biology models (species, reactions, kinetics), while BioPAX encodes qualitative pathway knowledge (interactions, regulations, pathway membership).
-
OWL provides the formal ontology framework underlying Gene Ontology, Disease Ontology, and other controlled vocabularies that structure biological knowledge as directed acyclic graphs.
-
CSV/TSV and JSON serve as general-purpose formats for tabular data and hierarchical API responses, respectively, complementing the domain-specific formats.
-
Data format conversion requires understanding coordinate systems, information content differences, and the availability of validated tools like samtools, bedtools, bcftools, and BioPython.
-
Data quality control is a continuous process applied at every pipeline stage — from raw read assessment (FastQC) through alignment QC (samtools flagstat) to variant QC (bcftools stats) — and must be documented for reproducibility.