Quiz: Generative AI and Large Language Models

Test your understanding of embeddings, attention mechanisms, and transformer architecture.

1. An embedding maps:

Continuous values to discrete tokens
Discrete tokens to continuous vectors
Images to text
Gradients to weights

Show Answer

The correct answer is B. An embedding maps discrete items (like words or tokens) to continuous vector representations. This enables neural networks to process symbolic data like text.

Concept Tested: Embedding

2. Cosine similarity between two vectors measures:

Their Euclidean distance
The angle between them (ignoring magnitude)
Their element-wise product sum
The difference in their norms

Show Answer

The correct answer is B. Cosine similarity measures the cosine of the angle between vectors: \(\frac{\mathbf{u} \cdot \mathbf{v}}{\|\mathbf{u}\|\|\mathbf{v}\|}\). It ranges from -1 to 1 and ignores vector magnitudes.

Concept Tested: Cosine Similarity

3. In attention mechanisms, the query-key dot product determines:

The size of the output
How much each position attends to other positions
The number of attention heads
The embedding dimension

Show Answer

The correct answer is B. The query-key dot product computes attention scores that determine how much each position should focus on other positions. Higher scores mean stronger attention.

Concept Tested: Attention Score

4. The scaling factor \(\sqrt{d_k}\) in attention is used to:

Increase the attention weights
Prevent softmax saturation from large dot products
Reduce the number of parameters
Speed up training

Show Answer

The correct answer is B. Dividing by \(\sqrt{d_k}\) prevents the dot products from becoming too large as dimension increases, which would push softmax into regions with very small gradients.

Concept Tested: Scaled Dot-Product Attention

5. Multi-head attention:

Uses only one attention computation
Runs multiple attention operations in parallel to capture different patterns
Eliminates the need for queries and keys
Replaces the transformer architecture

Show Answer

The correct answer is B. Multi-head attention runs several attention computations in parallel, each potentially learning different relationship patterns (syntactic, semantic, positional, etc.), then combines their outputs.

Concept Tested: Multi-Head Attention

6. Self-attention allows each position to:

Only attend to itself
Attend to all positions in the same sequence
Only attend to previous positions
Ignore all other positions

Show Answer

The correct answer is B. Self-attention allows each position in a sequence to attend to all other positions (including itself), enabling direct modeling of long-range dependencies.

Concept Tested: Self-Attention

7. Position encoding in transformers provides:

Word meaning information
Sequence order information since attention is permutation-invariant
Grammar rules
Vocabulary size

Show Answer

The correct answer is B. Since self-attention treats input as a set (permutation-invariant), position encodings inject information about the order of tokens in the sequence.

Concept Tested: Position Encoding

8. LoRA (Low-Rank Adaptation) reduces fine-tuning cost by:

Removing all attention layers
Adding trainable low-rank matrices instead of updating full weights
Using smaller vocabulary
Eliminating position encodings

Show Answer

The correct answer is B. LoRA keeps original weights frozen and adds small, trainable low-rank matrices (\(A\) and \(B\)) such that \(W' = W + BA\). This dramatically reduces the number of trainable parameters.

Concept Tested: LoRA

9. In a latent space, similar items typically:

Have very different vector representations
Are mapped to nearby points
Have zero cosine similarity
Require different embedding dimensions

Show Answer

The correct answer is B. A well-trained latent space maps semantically similar items to nearby points. This structure enables meaningful interpolation and arithmetic operations.

Concept Tested: Latent Space

10. The Value matrix in attention provides:

The content to be aggregated based on attention weights
The query for matching
The key for compatibility
The position information

Show Answer

The correct answer is A. The Value matrix contains the actual information to be retrieved. Attention weights (from Query-Key matching) determine how to combine Values to produce the output.

Concept Tested: Value Matrix