Quiz: Biological Databases

Test your understanding of major biological databases, their organization, programmatic access methods, and data provenance practices with these review questions.

1. What distinguishes Swiss-Prot from TrEMBL in the UniProt database?

Swiss-Prot contains only plant proteins while TrEMBL contains animal proteins
Swiss-Prot entries are manually curated by expert reviewers while TrEMBL entries are automatically annotated
Swiss-Prot is larger than TrEMBL because it includes predicted structures
Swiss-Prot stores nucleotide sequences while TrEMBL stores protein sequences

Show Answer

The correct answer is B. Swiss-Prot is the manually curated section of UniProt where each entry has been reviewed by a human expert who verifies the protein's existence and assigns functions based on published experimental evidence. TrEMBL entries are generated by computational translation of nucleotide sequences and annotated by automated pipelines. Swiss-Prot contains approximately 570,000 entries while TrEMBL has over 250 million, making option C incorrect.

Concept Tested: Swiss-Prot and TrEMBL

2. Which organization serves as the primary gateway to biological data in the United States, hosting GenBank, PubMed, and the Entrez search system?

EMBL-EBI (European Bioinformatics Institute)
Worldwide Protein Data Bank (wwPDB)
NCBI (National Center for Biotechnology Information)
Swiss Institute of Bioinformatics (SIB)

Show Answer

The correct answer is C. The National Center for Biotechnology Information (NCBI), established in 1988 as a division of the United States National Library of Medicine, hosts dozens of interconnected databases including GenBank, RefSeq, PubMed, and dbSNP. The Entrez search system links them together. EMBL-EBI (A) is the European counterpart. wwPDB (B) manages the Protein Data Bank. SIB (D) co-maintains UniProt.

Concept Tested: NCBI

3. How is the Gene Ontology (GO) organized structurally?

As a flat list of terms sorted alphabetically
As a simple tree where each term has exactly one parent
As a directed acyclic graph (DAG) where terms can have multiple parents
As an undirected network where terms are connected by similarity scores

Show Answer

The correct answer is C. The Gene Ontology is organized as a directed acyclic graph (DAG), not a simple tree. A single GO term can have multiple parent terms, reflecting the fact that biological concepts often belong to more than one category. For example, "DNA repair" is a child of both "cellular response to DNA damage stimulus" and "DNA metabolic process." This DAG structure is a natural fit for graph-based analysis.

Concept Tested: Gene Ontology Database

4. What type of data does the Protein Data Bank (PDB) primarily store?

Protein amino acid sequences in FASTA format
Gene expression levels from RNA-seq experiments
Experimentally determined three-dimensional structures of biological macromolecules
Protein-protein interaction networks with confidence scores

Show Answer

The correct answer is C. The Protein Data Bank (PDB) is the single global archive for experimentally determined three-dimensional structures of biological macromolecules, containing over 220,000 structures determined by X-ray crystallography, cryo-EM, and NMR spectroscopy. Protein sequences (A) are stored in UniProt. Expression data (B) is found in repositories like GEO. Interaction networks (D) are stored in databases like STRING and BioGRID.

Concept Tested: Protein Data Bank

5. Which database integrates data from 29 public sources into a heterogeneous network specifically designed for computational drug repurposing?

Hetionet
KEGG
Reactome
BioGRID

Show Answer

The correct answer is A. Hetionet integrates data from 29 public resources into a single heterogeneous network containing over 47,000 nodes of 11 types and over 2.2 million edges of 24 types. It was constructed specifically to enable computational drug repurposing through graph algorithms. KEGG (B) focuses on pathway maps. Reactome (C) is a curated pathway knowledgebase. BioGRID (D) curates genetic and protein interactions from literature.

Concept Tested: Hetionet Database

6. What is the primary difference between how STRING and BioGRID capture protein-protein interactions?

STRING only contains human interactions while BioGRID covers all organisms
BioGRID relies on literature curation of experimental data while STRING integrates experimental data with computational predictions and text mining
STRING stores three-dimensional structures while BioGRID stores sequences
BioGRID uses confidence scores while STRING does not

Show Answer

The correct answer is B. BioGRID focuses on literature curation where trained curators read published papers and extract interaction data using controlled vocabularies. STRING takes a broader approach by integrating experimental interaction data with computational predictions, text mining, and genomic context information, assigning a confidence score to each predicted interaction. Option D reverses the truth — STRING, not BioGRID, assigns confidence scores.

Concept Tested: STRING Database and BioGRID Database

7. Which of the following is a key element of data provenance in bioinformatics?

The physical location of the server hosting the database
The programming language used to query the database
Recording the database version, evidence codes, and download date used in an analysis
The number of citations a database has received in the literature

Show Answer

The correct answer is C. Data provenance refers to the documented history of data, including where it came from and what transformations it has undergone. Key elements include database version or release date, evidence codes indicating annotation reliability, curation status, download dates, and query parameters. These records ensure reproducibility and allow discrepancies between analyses to be traced to differences in underlying data rather than unexplained variation.

Concept Tested: Data Provenance

8. What are the three independent ontologies within the Gene Ontology (GO)?

Gene Structure, Gene Expression, Gene Regulation
Molecular Function, Biological Process, Cellular Component
Sequence Similarity, Structural Homology, Functional Analogy
DNA Ontology, RNA Ontology, Protein Ontology

Show Answer

The correct answer is B. The Gene Ontology provides a structured vocabulary organized into three independent ontologies: Molecular Function (activities at the molecular level), Biological Process (larger biological programs), and Cellular Component (locations within the cell). These three aspects describe what a gene product does, what biological goal it contributes to, and where in the cell it acts.

Concept Tested: Gene Ontology Database

9. In a REST API workflow for biological databases, what is the typical format of data returned by the server?

Microsoft Excel spreadsheets
Raw binary image files
Structured data in JSON or XML format
Unformatted plain text with no structure

Show Answer

The correct answer is C. REST APIs allow you to construct a URL that specifies a query and receive structured data, typically in JSON or XML format, in response. Most biological database APIs follow standard REST conventions. While some endpoints may return specialized formats like FASTA, the primary programmatic interchange formats are JSON and XML, which provide structured, machine-readable data suitable for automated parsing.

Concept Tested: REST APIs for Biology

10. What does the COSMIC database catalog?

Comparative genomic data across all vertebrate species
Metabolic pathway maps and enzymatic reactions
Somatic mutations found in human cancers
Three-dimensional structures of viral proteins

Show Answer

The correct answer is C. COSMIC (Catalogue Of Somatic Mutations In Cancer) is the world's largest resource for exploring the impact of somatic mutations in human cancer. It catalogs point mutations, gene fusions, genomic rearrangements, and copy number variations across all human cancer types. Comparative genomics (A) is provided by Ensembl. Metabolic pathways (B) are found in KEGG and BioCyc. Viral structures (D) would be in PDB.

Concept Tested: COSMIC Database