Skip to content

Quiz: Biological Databases

Test your understanding of major biological databases, their organization, programmatic access methods, and data provenance practices with these review questions.


1. What distinguishes Swiss-Prot from TrEMBL in the UniProt database?

  1. Swiss-Prot contains only plant proteins while TrEMBL contains animal proteins
  2. Swiss-Prot entries are manually curated by expert reviewers while TrEMBL entries are automatically annotated
  3. Swiss-Prot is larger than TrEMBL because it includes predicted structures
  4. Swiss-Prot stores nucleotide sequences while TrEMBL stores protein sequences
Show Answer

The correct answer is B. Swiss-Prot is the manually curated section of UniProt where each entry has been reviewed by a human expert who verifies the protein's existence and assigns functions based on published experimental evidence. TrEMBL entries are generated by computational translation of nucleotide sequences and annotated by automated pipelines. Swiss-Prot contains approximately 570,000 entries while TrEMBL has over 250 million, making option C incorrect.

Concept Tested: Swiss-Prot and TrEMBL


2. Which organization serves as the primary gateway to biological data in the United States, hosting GenBank, PubMed, and the Entrez search system?

  1. EMBL-EBI (European Bioinformatics Institute)
  2. Worldwide Protein Data Bank (wwPDB)
  3. NCBI (National Center for Biotechnology Information)
  4. Swiss Institute of Bioinformatics (SIB)
Show Answer

The correct answer is C. The National Center for Biotechnology Information (NCBI), established in 1988 as a division of the United States National Library of Medicine, hosts dozens of interconnected databases including GenBank, RefSeq, PubMed, and dbSNP. The Entrez search system links them together. EMBL-EBI (A) is the European counterpart. wwPDB (B) manages the Protein Data Bank. SIB (D) co-maintains UniProt.

Concept Tested: NCBI


3. How is the Gene Ontology (GO) organized structurally?

  1. As a flat list of terms sorted alphabetically
  2. As a simple tree where each term has exactly one parent
  3. As a directed acyclic graph (DAG) where terms can have multiple parents
  4. As an undirected network where terms are connected by similarity scores
Show Answer

The correct answer is C. The Gene Ontology is organized as a directed acyclic graph (DAG), not a simple tree. A single GO term can have multiple parent terms, reflecting the fact that biological concepts often belong to more than one category. For example, "DNA repair" is a child of both "cellular response to DNA damage stimulus" and "DNA metabolic process." This DAG structure is a natural fit for graph-based analysis.

Concept Tested: Gene Ontology Database


4. What type of data does the Protein Data Bank (PDB) primarily store?

  1. Protein amino acid sequences in FASTA format
  2. Gene expression levels from RNA-seq experiments
  3. Experimentally determined three-dimensional structures of biological macromolecules
  4. Protein-protein interaction networks with confidence scores
Show Answer

The correct answer is C. The Protein Data Bank (PDB) is the single global archive for experimentally determined three-dimensional structures of biological macromolecules, containing over 220,000 structures determined by X-ray crystallography, cryo-EM, and NMR spectroscopy. Protein sequences (A) are stored in UniProt. Expression data (B) is found in repositories like GEO. Interaction networks (D) are stored in databases like STRING and BioGRID.

Concept Tested: Protein Data Bank


5. Which database integrates data from 29 public sources into a heterogeneous network specifically designed for computational drug repurposing?

  1. Hetionet
  2. KEGG
  3. Reactome
  4. BioGRID
Show Answer

The correct answer is A. Hetionet integrates data from 29 public resources into a single heterogeneous network containing over 47,000 nodes of 11 types and over 2.2 million edges of 24 types. It was constructed specifically to enable computational drug repurposing through graph algorithms. KEGG (B) focuses on pathway maps. Reactome (C) is a curated pathway knowledgebase. BioGRID (D) curates genetic and protein interactions from literature.

Concept Tested: Hetionet Database


6. What is the primary difference between how STRING and BioGRID capture protein-protein interactions?

  1. STRING only contains human interactions while BioGRID covers all organisms
  2. BioGRID relies on literature curation of experimental data while STRING integrates experimental data with computational predictions and text mining
  3. STRING stores three-dimensional structures while BioGRID stores sequences
  4. BioGRID uses confidence scores while STRING does not
Show Answer

The correct answer is B. BioGRID focuses on literature curation where trained curators read published papers and extract interaction data using controlled vocabularies. STRING takes a broader approach by integrating experimental interaction data with computational predictions, text mining, and genomic context information, assigning a confidence score to each predicted interaction. Option D reverses the truth — STRING, not BioGRID, assigns confidence scores.

Concept Tested: STRING Database and BioGRID Database


7. Which of the following is a key element of data provenance in bioinformatics?

  1. The physical location of the server hosting the database
  2. The programming language used to query the database
  3. Recording the database version, evidence codes, and download date used in an analysis
  4. The number of citations a database has received in the literature
Show Answer

The correct answer is C. Data provenance refers to the documented history of data, including where it came from and what transformations it has undergone. Key elements include database version or release date, evidence codes indicating annotation reliability, curation status, download dates, and query parameters. These records ensure reproducibility and allow discrepancies between analyses to be traced to differences in underlying data rather than unexplained variation.

Concept Tested: Data Provenance


8. What are the three independent ontologies within the Gene Ontology (GO)?

  1. Gene Structure, Gene Expression, Gene Regulation
  2. Molecular Function, Biological Process, Cellular Component
  3. Sequence Similarity, Structural Homology, Functional Analogy
  4. DNA Ontology, RNA Ontology, Protein Ontology
Show Answer

The correct answer is B. The Gene Ontology provides a structured vocabulary organized into three independent ontologies: Molecular Function (activities at the molecular level), Biological Process (larger biological programs), and Cellular Component (locations within the cell). These three aspects describe what a gene product does, what biological goal it contributes to, and where in the cell it acts.

Concept Tested: Gene Ontology Database


9. In a REST API workflow for biological databases, what is the typical format of data returned by the server?

  1. Microsoft Excel spreadsheets
  2. Raw binary image files
  3. Structured data in JSON or XML format
  4. Unformatted plain text with no structure
Show Answer

The correct answer is C. REST APIs allow you to construct a URL that specifies a query and receive structured data, typically in JSON or XML format, in response. Most biological database APIs follow standard REST conventions. While some endpoints may return specialized formats like FASTA, the primary programmatic interchange formats are JSON and XML, which provide structured, machine-readable data suitable for automated parsing.

Concept Tested: REST APIs for Biology


10. What does the COSMIC database catalog?

  1. Comparative genomic data across all vertebrate species
  2. Metabolic pathway maps and enzymatic reactions
  3. Somatic mutations found in human cancers
  4. Three-dimensional structures of viral proteins
Show Answer

The correct answer is C. COSMIC (Catalogue Of Somatic Mutations In Cancer) is the world's largest resource for exploring the impact of somatic mutations in human cancer. It catalogs point mutations, gene fusions, genomic rearrangements, and copy number variations across all human cancer types. Comparative genomics (A) is provided by Ensembl. Metabolic pathways (B) are found in KEGG and BioCyc. Viral structures (D) would be in PDB.

Concept Tested: COSMIC Database