Image Processing and Computer Vision
Summary
Images are matrices of pixel values, making linear algebra the natural language for image processing and computer vision. This chapter covers image representation as matrices and tensors, convolution operations, image filtering (blur, sharpen, edge detection), Fourier transforms, and feature detection. You will also learn about homography for perspective transformations.
Concepts Covered
This chapter covers the following 16 concepts from the learning graph:
- Image Matrix
- Grayscale Image
- RGB Image
- Image Tensor
- Image Convolution
- Image Filter
- Blur Filter
- Sharpen Filter
- Edge Detection
- Sobel Operator
- Fourier Transform
- Frequency Domain
- Image Compression
- Color Space Transform
- Feature Detection
- Homography
Prerequisites
This chapter builds on concepts from:
- Chapter 2: Matrices and Matrix Operations
- Chapter 4: Linear Transformations
- Chapter 7: Matrix Decompositions (for SVD-based compression)
- Chapter 10: Neural Networks and Deep Learning (for tensors)
Introduction
Every digital image you see—from smartphone photos to medical scans—is fundamentally a matrix of numbers. This mathematical representation enables powerful image processing operations using linear algebra:
- Matrix operations enable pixel-wise transformations
- Convolution applies local filters for blurring, sharpening, and edge detection
- Fourier transforms reveal frequency content for compression and filtering
- Linear transformations correct perspective and align images
Understanding these foundations illuminates how image editing software, computer vision systems, and neural networks process visual information.
Image Representation
The Image Matrix
A digital image is stored as a matrix where each entry represents a pixel's intensity or color value. For an image with height \(H\) and width \(W\):
\(\mathbf{I} \in \mathbb{R}^{H \times W}\)
The matrix entry \(I_{i,j}\) corresponds to the pixel at row \(i\) (from top) and column \(j\) (from left). Pixel values are typically integers in \([0, 255]\) or floats in \([0, 1]\).
Grayscale Images
A grayscale image uses a single value per pixel representing light intensity:
- 0 = black (no light)
- 255 = white (maximum light)
- Intermediate values = shades of gray
| Value Range | Interpretation |
|---|---|
| 0-50 | Very dark |
| 50-100 | Dark gray |
| 100-150 | Medium gray |
| 150-200 | Light gray |
| 200-255 | Very bright |
A 1920×1080 HD grayscale image is a matrix with over 2 million entries.
Diagram: Image Matrix Visualizer
Run the Image Matrix Visualizer Fullscreen
Image Matrix Visualizer
Type: microsim
Learning objective: Understand how pixel values in a matrix correspond to image appearance (Bloom: Understand)
Visual elements: - Left panel: Small grayscale image (e.g., 8×8 or 16×16 pixels) - Right panel: Matrix showing numeric values - Hover highlight connecting matrix cell to corresponding pixel - Color gradient legend (0=black to 255=white)
Interactive controls: - Dropdown: Select sample image (checkerboard, gradient, simple shape, random) - Slider: Zoom level for image display - Click on matrix cell to highlight corresponding pixel - Click on pixel to highlight corresponding matrix cell - Button: "Edit mode" to modify individual pixel values
Canvas layout: 800x500px with side-by-side image and matrix
Behavior: - Hovering over matrix cell highlights pixel with colored border - Editing a value immediately updates the image - Show real-time pixel coordinates and value on hover - Display image dimensions and total pixel count
Implementation: p5.js with matrix rendering and image display
RGB Images
Color images use three channels—Red, Green, Blue—each stored as a separate matrix:
\(\mathbf{I}_R, \mathbf{I}_G, \mathbf{I}_B \in \mathbb{R}^{H \times W}\)
Each pixel is a triplet \((r, g, b)\) where each component ranges from 0 to 255:
| Color | RGB Value |
|---|---|
| Red | (255, 0, 0) |
| Green | (0, 255, 0) |
| Blue | (0, 0, 255) |
| Yellow | (255, 255, 0) |
| Cyan | (0, 255, 255) |
| Magenta | (255, 0, 255) |
| White | (255, 255, 255) |
| Black | (0, 0, 0) |
The RGB color model is additive—colors are created by adding light. Mixing all three at full intensity produces white.
Image Tensors
For computational purposes, color images are often represented as 3D tensors:
\(\mathbf{I} \in \mathbb{R}^{H \times W \times C}\)
where:
- \(H\) is the height (rows)
- \(W\) is the width (columns)
- \(C\) is the number of channels (3 for RGB, 4 for RGBA with transparency)
In deep learning, batches of images form 4D tensors:
\(\mathbf{X} \in \mathbb{R}^{N \times C \times H \times W}\)
where \(N\) is the batch size. The ordering of dimensions varies by framework (PyTorch uses \(N \times C \times H \times W\); TensorFlow often uses \(N \times H \times W \times C\)).
Diagram: RGB Channel Decomposition
Run the RGB Channel Decomposition Fullscreen
RGB Channel Decomposition
Type: microsim
Learning objective: Visualize how RGB channels combine to form color images (Bloom: Analyze)
Visual elements: - Original color image (center or top) - Three grayscale images showing R, G, B channels separately - Color-tinted versions of each channel (red, green, blue tints) - Reconstructed image from selected channels
Interactive controls: - Image selector: Choose from sample images - Channel toggles: Enable/disable R, G, B for reconstruction - Slider: Adjust individual channel intensity (0-100%) - Toggle: Show grayscale vs color-tinted channel views
Canvas layout: 800x600px with original and channel displays
Behavior: - Disabling a channel shows image without that color component - Adjusting channel intensity shows over/under-saturation effects - Hover over pixel to show (r, g, b) values - Real-time reconstruction as channels are modified
Implementation: p5.js with pixel array manipulation
Color Space Transforms
While RGB is standard for display, other color spaces are useful for specific tasks:
| Color Space | Components | Use Cases |
|---|---|---|
| RGB | Red, Green, Blue | Display, storage |
| HSV/HSB | Hue, Saturation, Value | Color selection, segmentation |
| HSL | Hue, Saturation, Lightness | Web design, color adjustment |
| YCbCr | Luminance, Chrominance | JPEG compression |
| LAB | Lightness, a, b | Perceptual uniformity |
Converting between color spaces involves matrix transformations. For example, RGB to YCbCr:
RGB to YCbCr Transform
\(\begin{bmatrix} Y \\ C_b \\ C_r \end{bmatrix} = \begin{bmatrix} 0.299 & 0.587 & 0.114 \\ -0.169 & -0.331 & 0.500 \\ 0.500 & -0.419 & -0.081 \end{bmatrix} \begin{bmatrix} R \\ G \\ B \end{bmatrix} + \begin{bmatrix} 0 \\ 128 \\ 128 \end{bmatrix}\)
where:
- \(Y\) is luminance (brightness information)
- \(C_b\) and \(C_r\) are chrominance (color difference signals)
- The matrix weights reflect human perception (green contributes most to perceived brightness)
This separation enables chroma subsampling—reducing color resolution while preserving brightness detail—which is fundamental to JPEG compression.
Image Convolution
Convolution is the foundational operation for image filtering. It applies a small matrix called a kernel or filter to every location in the image.
The Convolution Operation
For a 2D image \(\mathbf{I}\) and kernel \(\mathbf{K}\) of size \((2k+1) \times (2k+1)\):
2D Convolution Formula
\((\mathbf{I} * \mathbf{K})[i, j] = \sum_{m=-k}^{k} \sum_{n=-k}^{k} I[i+m, j+n] \cdot K[m, n]\)
where:
- \(*\) denotes convolution (not multiplication)
- \((i, j)\) is the output pixel location
- The kernel is centered at each pixel
- Values outside the image boundary require padding (zero, replicate, or reflect)
Convolution is a linear operation: convolving with a sum of kernels equals the sum of individual convolutions.
Image Filters
An image filter is a kernel designed to achieve a specific effect. The kernel values determine what features are enhanced or suppressed.
| Filter Type | Effect | Kernel Property |
|---|---|---|
| Low-pass (blur) | Smooths, reduces noise | Positive values, sums to 1 |
| High-pass (sharpen) | Enhances edges | Positive center, negative surround |
| Derivative (edge) | Detects changes | Sums to 0, antisymmetric |
| Identity | No change | 1 at center, 0 elsewhere |
Diagram: Convolution Visualizer
Run the Convolution Visualizer Fullscreen
Convolution Visualizer
Type: microsim
Learning objective: Understand how convolution applies a kernel to an image (Bloom: Apply)
Visual elements: - Input image (grayscale, small for clarity) - Sliding kernel window highlighted on input - Kernel matrix with current values displayed - Output image being constructed - Calculation panel showing element-wise products and sum
Interactive controls: - Kernel selector: Identity, Blur 3×3, Sharpen, Edge detect - Custom kernel editor: 3×3 grid of editable values - Animation speed slider - Button: "Step" (advance one pixel) - Button: "Run" (animate full convolution) - Toggle: Show/hide calculation details
Canvas layout: 900x600px with input, kernel, and output panels
Behavior: - Animate kernel sliding across input image - Show element-wise multiplication at current position - Display running sum being computed - Fill in output pixel when calculation completes - Handle boundary conditions visually
Implementation: p5.js with step-by-step animation
Blur Filters
Blur filters smooth images by averaging neighboring pixels, reducing noise and fine detail.
Box Blur (Mean Filter):
\(\mathbf{K}_{box} = \frac{1}{9} \begin{bmatrix} 1 & 1 & 1 \\ 1 & 1 & 1 \\ 1 & 1 & 1 \end{bmatrix}\)
Every pixel contributes equally. Simple but creates blocky artifacts.
Gaussian Blur:
\(\mathbf{K}_{gauss} = \frac{1}{16} \begin{bmatrix} 1 & 2 & 1 \\ 2 & 4 & 2 \\ 1 & 2 & 1 \end{bmatrix}\)
Weights follow a Gaussian distribution—center pixels contribute more. Produces smoother, more natural blurring.
The general Gaussian kernel is:
\(G(x, y) = \frac{1}{2\pi\sigma^2} e^{-\frac{x^2 + y^2}{2\sigma^2}}\)
where \(\sigma\) controls the blur radius. Larger \(\sigma\) means more blurring.
Sharpen Filters
Sharpen filters enhance edges and fine detail by emphasizing differences between neighboring pixels:
\(\mathbf{K}_{sharpen} = \begin{bmatrix} 0 & -1 & 0 \\ -1 & 5 & -1 \\ 0 & -1 & 0 \end{bmatrix}\)
This kernel can be understood as: - Identity (center = 1) plus - Negative Laplacian (edge enhancement)
The result amplifies differences while preserving overall brightness.
Diagram: Filter Effects Gallery
Run the Filter Effects Gallery Fullscreen
Filter Effects Gallery
Type: microsim
Learning objective: Compare effects of different filters on the same image (Bloom: Evaluate)
Visual elements: - Original image (top center) - Grid of filtered results (4-6 variants) - Kernel display below each filtered image - Difference image option (filtered - original)
Interactive controls: - Image selector: Choose sample image - Filter checkboxes: Select which filters to display - Slider: Filter strength/radius where applicable - Toggle: Show difference images - Toggle: Show kernel matrices
Canvas layout: 900x700px with grid layout
Behavior: - All filters applied to same source image - Hover over filtered image to see kernel - Click to show full-size comparison - Side-by-side slider for before/after on selected filter
Implementation: p5.js with multiple filter implementations
Edge Detection
Edge detection identifies boundaries in images where intensity changes rapidly. Edges correspond to object boundaries, texture changes, and depth discontinuities.
Derivative-Based Edge Detection
Edges are locations of high gradient magnitude. The image gradient at pixel \((i, j)\):
\(\nabla I = \begin{bmatrix} \frac{\partial I}{\partial x} \\ \frac{\partial I}{\partial y} \end{bmatrix}\)
The gradient magnitude and direction:
\(|\nabla I| = \sqrt{\left(\frac{\partial I}{\partial x}\right)^2 + \left(\frac{\partial I}{\partial y}\right)^2}\)
\(\theta = \arctan\left(\frac{\partial I / \partial y}{\partial I / \partial x}\right)\)
The Sobel Operator
The Sobel operator approximates image derivatives using convolution:
Sobel Kernels
\(\mathbf{G}_x = \begin{bmatrix} -1 & 0 & +1 \\ -2 & 0 & +2 \\ -1 & 0 & +1 \end{bmatrix}, \quad \mathbf{G}_y = \begin{bmatrix} -1 & -2 & -1 \\ 0 & 0 & 0 \\ +1 & +2 & +1 \end{bmatrix}\)
where:
- \(\mathbf{G}_x\) detects vertical edges (horizontal gradient)
- \(\mathbf{G}_y\) detects horizontal edges (vertical gradient)
- The weights provide smoothing (via the 1-2-1 pattern) combined with differentiation
The Sobel operator combines Gaussian smoothing with differentiation, making it more robust to noise than simple difference operators.
| Edge Detector | Kernel Size | Noise Sensitivity | Edge Localization |
|---|---|---|---|
| Simple difference | 1×2 | High | Good |
| Prewitt | 3×3 | Medium | Good |
| Sobel | 3×3 | Low | Good |
| Scharr | 3×3 | Low | Better rotational symmetry |
Diagram: Edge Detection Visualizer
Run the Edge Detection Visualizer Fullscreen
Edge Detection Visualizer
Type: microsim
Learning objective: Understand how Sobel operators detect edges in different orientations (Bloom: Analyze)
Visual elements: - Original grayscale image - Gx result (horizontal gradient) with positive/negative coloring - Gy result (vertical gradient) with positive/negative coloring - Gradient magnitude image - Gradient direction overlay (as arrows or color wheel)
Interactive controls: - Image selector: Simple shapes, photos, text - Toggle: Show Gx, Gy, magnitude, direction - Slider: Threshold for edge display - Toggle: Show gradient arrows at sample points - Dropdown: Edge detector (Sobel, Prewitt, Scharr)
Canvas layout: 850x650px with multi-panel display
Behavior: - Compute and display gradient components - Color-code positive (white) and negative (black) gradients - Threshold slider converts magnitude to binary edges - Arrow overlay shows gradient direction at grid points
Implementation: p5.js with Sobel convolution
Edge Detection Pipeline
Complete edge detection typically involves:
- Smoothing: Gaussian blur to reduce noise
- Gradient computation: Sobel or similar operator
- Non-maximum suppression: Thin edges to single-pixel width
- Thresholding: Keep only strong edges (hysteresis thresholding in Canny)
The Canny edge detector implements this full pipeline and remains widely used.
Fourier Transform for Images
The Fourier transform represents images in the frequency domain, revealing periodic patterns and enabling frequency-based filtering.
The 2D Discrete Fourier Transform
For an \(M \times N\) image, the 2D DFT is:
2D Discrete Fourier Transform
\(F(u, v) = \sum_{x=0}^{M-1} \sum_{y=0}^{N-1} f(x, y) \cdot e^{-2\pi i \left(\frac{ux}{M} + \frac{vy}{N}\right)}\)
where:
- \(f(x, y)\) is the spatial domain image
- \(F(u, v)\) is the frequency domain representation
- \((u, v)\) are frequency coordinates
- Low frequencies (near center) represent gradual changes
- High frequencies (near edges) represent rapid changes
The inverse transform recovers the original image:
\(f(x, y) = \frac{1}{MN} \sum_{u=0}^{M-1} \sum_{v=0}^{N-1} F(u, v) \cdot e^{2\pi i \left(\frac{ux}{M} + \frac{vy}{N}\right)}\)
Frequency Domain Interpretation
| Frequency Region | Spatial Meaning |
|---|---|
| DC component (center) | Average brightness |
| Low frequencies | Smooth regions, gradual changes |
| High frequencies | Edges, textures, fine detail |
| Specific frequency | Periodic patterns (e.g., stripes) |
The magnitude spectrum \(|F(u,v)|\) shows which frequencies are present.
The phase spectrum \(\angle F(u,v)\) encodes spatial structure—surprisingly, phase carries more perceptual information than magnitude.
Diagram: Fourier Transform Visualizer
Run the Fourier Transform Visualizer Fullscreen
Fourier Transform Visualizer
Type: microsim
Learning objective: Connect spatial image features to frequency domain representation (Bloom: Analyze)
Visual elements: - Original image (left) - Magnitude spectrum (center, log-scaled for visibility) - Phase spectrum (right, as color wheel or grayscale) - Reconstructed image after filtering
Interactive controls: - Image selector: Stripes, checkerboard, photo, text - Frequency filter: Low-pass, high-pass, band-pass (adjustable radius) - Toggle: Show magnitude/phase/both - Slider: Filter cutoff frequency - Button: Apply filter and show reconstruction
Canvas layout: 900x600px with three-panel display
Behavior: - Compute and display FFT magnitude (log scale) - Interactive frequency selection by clicking on spectrum - Show effect of zeroing selected frequencies - Real-time filter preview on reconstruction
Implementation: p5.js with FFT library (or pre-computed examples)
Frequency Domain Filtering
Filtering in the frequency domain involves:
- Compute FFT of image: \(F = \mathcal{F}(f)\)
- Multiply by filter: \(G = F \cdot H\)
- Compute inverse FFT: \(g = \mathcal{F}^{-1}(G)\)
Common frequency domain filters:
| Filter | Frequency Mask | Effect |
|---|---|---|
| Ideal low-pass | 1 if \(\sqrt{u^2+v^2} < r\), else 0 | Blur (with ringing) |
| Gaussian low-pass | \(e^{-(u^2+v^2)/2\sigma^2}\) | Smooth blur |
| Ideal high-pass | 0 if \(\sqrt{u^2+v^2} < r\), else 1 | Edge enhancement |
| Notch filter | 0 at specific frequencies | Remove periodic noise |
The convolution theorem connects spatial and frequency domains:
\(f * g = \mathcal{F}^{-1}(\mathcal{F}(f) \cdot \mathcal{F}(g))\)
Convolution in spatial domain equals multiplication in frequency domain.
Image Compression
Image compression reduces storage size by exploiting redundancies in image data. Linear algebra provides key tools for lossy compression.
SVD-Based Compression
The Singular Value Decomposition represents an image as a sum of rank-1 matrices:
\(\mathbf{I} = \sum_{i=1}^{r} \sigma_i \mathbf{u}_i \mathbf{v}_i^\top\)
where:
- \(\sigma_i\) are singular values (in decreasing order)
- \(\mathbf{u}_i, \mathbf{v}_i\) are left and right singular vectors
- \(r\) is the rank of the image matrix
Keeping only the \(k\) largest singular values gives the best rank-\(k\) approximation:
\(\mathbf{I}_k = \sum_{i=1}^{k} \sigma_i \mathbf{u}_i \mathbf{v}_i^\top\)
| Original Storage | Compressed Storage | Compression Ratio |
|---|---|---|
| \(H \times W\) values | \(k(H + W + 1)\) values | \(\frac{HW}{k(H+W+1)}\) |
For a 1000×1000 image with \(k=50\): compression ratio ≈ 10:1.
Diagram: SVD Compression Visualizer
Run the SVD Compression Visualizer Fullscreen
SVD Compression Visualizer
Type: microsim
Learning objective: Understand how SVD achieves image compression by truncating singular values (Bloom: Apply)
Visual elements: - Original image - Reconstructed image from k singular values - Singular value plot (bar chart or line, log scale) - Error image (original - reconstructed) - Compression statistics display
Interactive controls: - Slider: Number of singular values k (1 to full rank) - Image selector: Choose sample image - Toggle: Show error image - Toggle: Log scale for singular values - Display: Compression ratio, PSNR, storage size
Canvas layout: 850x650px with image comparison and statistics
Behavior: - Real-time reconstruction as k changes - Show how image quality improves with more singular values - Display storage calculation - Highlight "knee" in singular value curve
Implementation: p5.js with pre-computed SVD (for performance)
Transform Coding (JPEG)
JPEG compression uses the Discrete Cosine Transform (DCT):
- Divide image into 8×8 blocks
- Apply 2D DCT to each block
- Quantize coefficients (lossy step)
- Entropy encode (lossless)
The DCT concentrates energy in low-frequency coefficients, which are preserved, while high-frequency coefficients (fine detail) are quantized more aggressively.
Comparison of Compression Approaches
| Method | Type | Compression | Quality Control |
|---|---|---|---|
| SVD truncation | Lossy | Variable | Number of singular values |
| JPEG (DCT) | Lossy | High | Quality factor (1-100) |
| PNG (lossless) | Lossless | Moderate | N/A |
| WebP | Both | High | Quality factor |
Feature Detection
Feature detection identifies distinctive points in images that can be matched across views, enabling applications like image stitching, object recognition, and 3D reconstruction.
What Makes a Good Feature?
A good feature should be:
- Distinctive: Different from its neighbors
- Repeatable: Detectable despite changes in viewpoint, lighting, scale
- Localizable: Precisely positioned
| Feature Type | Detection Method | Invariance |
|---|---|---|
| Corners | Harris, Shi-Tomasi | Rotation |
| Blobs | DoG (SIFT), Hessian (SURF) | Rotation, Scale |
| Edges | Canny, Sobel | Limited |
Harris Corner Detection
The Harris corner detector uses the structure tensor (second moment matrix):
Structure Tensor
\(\mathbf{M} = \sum_{(x,y) \in W} \begin{bmatrix} I_x^2 & I_x I_y \\ I_x I_y & I_y^2 \end{bmatrix}\)
where:
- \(I_x, I_y\) are image derivatives
- \(W\) is a local window (often Gaussian-weighted)
- The eigenvalues of \(\mathbf{M}\) characterize the local structure
| Eigenvalue Pattern | Structure | Feature Type |
|---|---|---|
| Both small | Flat region | Not a feature |
| One large, one small | Edge | Edge point |
| Both large | Corner | Good feature |
The Harris corner response:
\(R = \det(\mathbf{M}) - k \cdot \text{trace}(\mathbf{M})^2 = \lambda_1 \lambda_2 - k(\lambda_1 + \lambda_2)^2\)
where \(k \approx 0.04\). Corners have high \(R\) values.
Diagram: Corner Detection Visualizer
Run the Corner Detection Visualizer Fullscreen
Corner Detection Visualizer
Type: microsim
Learning objective: Understand how eigenvalues of the structure tensor identify corners (Bloom: Analyze)
Visual elements: - Input image with detected corners marked - Structure tensor ellipse visualization at selected points - Eigenvalue scatter plot (λ1 vs λ2) - Corner response heat map
Interactive controls: - Image selector: Checkerboard, building, shapes - Slider: Harris k parameter (0.01 to 0.1) - Slider: Response threshold - Click on image to show local structure tensor - Toggle: Show response heat map vs corner points
Canvas layout: 850x650px with image and analysis panels
Behavior: - Compute Harris response for entire image - Mark corners above threshold - Click on pixel to show structure tensor as ellipse - Ellipse axes aligned with eigenvectors, lengths proportional to eigenvalues
Implementation: p5.js with gradient and matrix computation
Scale-Invariant Features
SIFT (Scale-Invariant Feature Transform) detects features at multiple scales:
- Build scale space using Gaussian blur at multiple \(\sigma\) values
- Detect blob-like structures as extrema in Difference-of-Gaussian (DoG)
- Compute orientation from local gradients
- Build descriptor from gradient histograms
The 128-dimensional SIFT descriptor is robust to scale, rotation, and moderate viewpoint changes.
Homography
A homography is a projective transformation between two planes, essential for perspective correction, panorama stitching, and augmented reality.
The Homography Matrix
A homography maps points from one image to another via a 3×3 matrix:
Homography Transformation
\(\begin{bmatrix} x' \\ y' \\ 1 \end{bmatrix} \sim \mathbf{H} \begin{bmatrix} x \\ y \\ 1 \end{bmatrix} = \begin{bmatrix} h_{11} & h_{12} & h_{13} \\ h_{21} & h_{22} & h_{23} \\ h_{31} & h_{32} & h_{33} \end{bmatrix} \begin{bmatrix} x \\ y \\ 1 \end{bmatrix}\)
where:
- \((x, y)\) and \((x', y')\) are corresponding points in homogeneous coordinates
- \(\sim\) denotes equality up to scale
- \(\mathbf{H}\) has 8 degrees of freedom (9 elements minus 1 for scale)
To recover Cartesian coordinates:
\(x' = \frac{h_{11}x + h_{12}y + h_{13}}{h_{31}x + h_{32}y + h_{33}}, \quad y' = \frac{h_{21}x + h_{22}y + h_{23}}{h_{31}x + h_{32}y + h_{33}}\)
Estimating Homographies
Given \(n \geq 4\) point correspondences, we can estimate \(\mathbf{H}\) by solving a linear system.
Each correspondence \((x_i, y_i) \leftrightarrow (x_i', y_i')\) provides two equations:
\(\begin{bmatrix} x_i & y_i & 1 & 0 & 0 & 0 & -x_i'x_i & -x_i'y_i & -x_i' \\ 0 & 0 & 0 & x_i & y_i & 1 & -y_i'x_i & -y_i'y_i & -y_i' \end{bmatrix} \mathbf{h} = \mathbf{0}\)
With 4+ correspondences, solve using SVD (find the null space of the coefficient matrix).
Transformation Hierarchy
Homographies are part of a hierarchy of 2D transformations:
| Transformation | DOF | Preserves | Matrix Form |
|---|---|---|---|
| Translation | 2 | Everything | $[I |
| Euclidean (rigid) | 3 | Distances, angles | $[R |
| Similarity | 4 | Angles, ratios | $[sR |
| Affine | 6 | Parallelism | $[A |
| Projective (homography) | 8 | Straight lines | General 3×3 |
Diagram: Homography Transformation Demo
Run the Homography Transformation Demo Fullscreen
Homography Transformation Demo
Type: microsim
Learning objective: Understand how homographies transform images for perspective correction (Bloom: Apply)
Visual elements: - Source image with draggable corner points - Transformed image showing perspective effect - Homography matrix display - Grid overlay showing transformation
Interactive controls: - Drag corners of source quadrilateral - Preset buttons: Perspective, rotation, shear, stretch - Button: "Reset to identity" - Toggle: Show transformation grid - Display: Homography matrix values
Canvas layout: 800x600px with side-by-side source and result
Behavior: - Compute homography from corner positions in real-time - Apply inverse warp to display transformed image - Show how parallel lines may not remain parallel - Demonstrate perspective correction (make rectangle from trapezoid)
Implementation: p5.js with bilinear interpolation for warping
Applications of Homography
| Application | How Homography Is Used |
|---|---|
| Panorama stitching | Align overlapping images |
| Perspective correction | Straighten tilted documents/signs |
| Augmented reality | Overlay virtual objects on planar surfaces |
| Sports graphics | Insert ads on playing field |
| Image rectification | Correct lens distortion |
Python Implementation
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 | |
Summary
Image processing and computer vision are built on linear algebra foundations:
Image Representation:
- Images are matrices (grayscale) or tensors (color)
- RGB images have three channels; other color spaces serve specific purposes
- Color space transforms are matrix operations
Filtering and Convolution:
- Convolution applies kernels to extract features
- Blur filters average neighbors; sharpen filters enhance differences
- Edge detection uses derivative-approximating kernels like Sobel
Frequency Domain:
- Fourier transform reveals periodic structure
- Low frequencies encode smooth regions; high frequencies encode edges
- Frequency filtering multiplies spectra
Compression:
- SVD truncation provides rank-based compression
- Transform coding (DCT/JPEG) exploits frequency concentration
Features and Geometry:
- Corner detection uses structure tensor eigenvalues
- Homographies are projective transformations between planes
- 4 point correspondences determine a homography
Self-Check Questions
Why does Gaussian blur produce smoother results than box blur?
Box blur weights all pixels in the kernel equally, creating abrupt transitions at the kernel boundary. This causes visible artifacts, especially in smooth gradients. Gaussian blur weights central pixels more heavily, with weights falling off smoothly according to the Gaussian function. This smooth weighting produces gradual blending without sharp discontinuities. Additionally, the Gaussian kernel is separable, meaning a 2D Gaussian blur can be computed as two 1D blurs, making it computationally efficient.
How does the Sobel operator combine smoothing with differentiation?
The Sobel kernels have a specific structure: they can be decomposed as the outer product of a smoothing filter [1, 2, 1] and a difference filter [-1, 0, 1]. For example, Gx = [1, 2, 1]^T × [-1, 0, 1]. The [1, 2, 1] component provides Gaussian-like smoothing perpendicular to the derivative direction, reducing noise sensitivity. The [-1, 0, 1] component computes the derivative. This combination makes Sobel more robust than simple difference operators while maintaining good edge localization.
Why does keeping the largest singular values give the best low-rank approximation?
The Eckart-Young-Mirsky theorem proves that the truncated SVD gives the optimal low-rank approximation in both Frobenius and spectral norms. Intuitively, each singular value represents the "energy" captured by that component—larger singular values correspond to more significant patterns in the image. By keeping the largest singular values, we retain the most important structure while discarding fine details associated with smaller singular values. The approximation error equals the sum of squared discarded singular values.
What is the geometric meaning of the Harris corner response being large?
The Harris response R = λ₁λ₂ - k(λ₁ + λ₂)² is large when both eigenvalues of the structure tensor are large. The eigenvalues measure how quickly image intensity changes in the principal directions. At a corner, intensity changes significantly in both directions (both eigenvalues large). At an edge, intensity changes in only one direction (one eigenvalue large, one small). In a flat region, intensity is constant (both eigenvalues small). The product λ₁λ₂ in the response ensures both must be large.