Finding Files, Archives and Compression
Summary
This chapter teaches you to search for files efficiently and then compress files and create archives. You'll master powerful file search techniques with find, locate, which, and whereis, then learn gzip, bzip2, tar, and zip commands along with different compression algorithms and their trade-offs. These skills are essential for finding files quickly and managing disk space.
Concepts Covered
This chapter covers the following 30 concepts from the learning graph:
- Find Command
- Find by Name
- Find by Type
- Find by Size
- Find by Time
- Find with Exec
- Locate Command
- Updatedb Command
- Which Command
- Whereis Command
- Type Command
- File Search Patterns
- Recursive Search
- Search Optimization
- Index Databases
- File Compression
- Gzip Command
- Gunzip Command
- Bzip2 Command
- Xz Command
- Tar Command
- Tar Create
- Tar Extract
- Tar Options
- Zip Command
- Unzip Command
- Archive Formats
- Compression Ratios
- 7zip Command
- File Archiving
Prerequisites
This chapter builds on concepts from:
- Chapter 4: File System Fundamentals
- Chapter 5: File Operations and Manipulation
- Chapter 6: Advanced File Operations
The "BOF" Story
Early in my career I managed a small group of about 14 engineers. Our team shared a file system that was always running out of space and causing problems. We were spending a large part of our team meetings trying to manage the disk space. One day I got tired of all the drama and sat down and wrote a program that created a Big-old-file scorecard. It didn't just find the largest files or the oldest files. It created a weighted score that combined these two numbers. Our meetings were much more productive and we could then automate the process of automatically deleting big old files on a regular basis.
What I learned is that teams that manage resources need great tools to find specific files and then do tasks like compress, archive or delete old files.
Find It and Squeeze It!
Ever wondered how to find that one config file you edited three weeks ago somewhere in your home directory? Or how you can send a 500MB folder as a 50MB attachment? This chapter has you covered!
You'll learn to search through thousands of files in seconds, then compress files like a boss and create tidy archives. These are the skills that separate casual users from command-line ninjas. Let's get seeking and squeezing!
Part 1: Finding Files
The first half of this chapter covers the essential tools for locating files on your Linux system.
The Find Command: The Search Swiss Army Knife
The find command is the most powerful file search tool on Linux. It searches directories recursively, matching files by name, type, size, time, permissions, and more.
1 2 3 4 5 6 7 8 9 10 11 | |
Find by Name: Matching Filenames
Find by name searches using patterns with wildcards.
1 2 3 4 5 6 7 8 9 10 11 12 13 | |
Find by Type: Files, Directories, and More
Find by type filters by what kind of filesystem object you want.
1 2 3 4 5 6 7 8 9 10 11 12 | |
Type Options
| Type | Meaning |
|---|---|
f |
Regular file |
d |
Directory |
l |
Symbolic link |
b |
Block device |
c |
Character device |
p |
Named pipe (FIFO) |
s |
Socket |
Find by Size: Big Files, Small Files
Find by size locates files based on their size.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | |
Size Units
| Unit | Meaning |
|---|---|
c |
Bytes |
k |
Kilobytes |
M |
Megabytes |
G |
Gigabytes |
Finding Disk Space Hogs
1 2 | |
Find by Time: When Was It Modified?
Find by time searches based on when files were accessed, modified, or created.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | |
Time Options
| Option | Meaning | Unit |
|---|---|---|
-mtime |
Modification time | Days |
-atime |
Access time | Days |
-ctime |
Change time (metadata) | Days |
-mmin |
Modification time | Minutes |
-amin |
Access time | Minutes |
-cmin |
Change time | Minutes |
-newer |
Newer than file | - |
Find with Exec: Take Action
Find with exec runs commands on each file found. This is incredibly powerful!
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | |
Exec Syntax
{}is replaced with the filename\;ends the command (must be escaped)- Use
+instead of\;to batch files (faster):
1 2 | |
Exec vs Xargs
1 2 3 4 5 6 | |
The Locate Command: Instant Search
The locate command finds files instantly by searching a pre-built database. It's MUCH faster than find for simple name searches.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | |
Locate Limitations
- Only searches by filename (not content, size, time)
- Database might be outdated (files added/deleted since last update)
- Doesn't search all directories (respects privacy settings)
The Updatedb Command: Refresh the Database
The updatedb command rebuilds the database that locate uses.
1 2 3 4 5 6 7 8 | |
The database is typically updated daily by a cron job. If you just created files and locate doesn't find them, run updatedb manually.
Index Databases: How Locate Works
Index databases store file paths for quick searching. Here's how it works:
updatedbscans the filesystem- It builds a database of all file paths
locatesearches this database (not the filesystem)- Database is typically at
/var/lib/mlocate/mlocate.db
Configure What Gets Indexed
1 2 3 4 5 6 7 8 | |
The Which Command: Find Commands
The which command shows the full path of executable commands.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | |
The Whereis Command: Find Binaries, Source, and Docs
The whereis command finds the binary, source, and man page for a command.
1 2 3 4 5 6 7 8 9 10 11 12 | |
Which vs Whereis
| Command | Finds | Searches |
|---|---|---|
which |
Executable only | PATH |
whereis |
Binary, source, man | Standard locations |
The Type Command: What Kind of Command?
The type command tells you what kind of command something is.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | |
Command Types
| Type | Meaning |
|---|---|
| alias | Shell alias |
| builtin | Shell built-in command |
| file | External executable |
| function | Shell function |
| keyword | Shell keyword |
File Search Patterns: Wildcards and Globs
File search patterns use wildcards (globs) to match multiple files.
Pattern Characters
| Pattern | Matches |
|---|---|
* |
Zero or more characters |
? |
Exactly one character |
[abc] |
One of: a, b, or c |
[a-z] |
One character in range |
[!abc] |
NOT one of: a, b, or c |
Examples
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | |
Recursive Search: Going Deep
Recursive search descends into subdirectories. Most search tools do this by default.
1 2 3 4 5 6 7 8 9 10 11 12 13 | |
Search Optimization: Faster Searches
Search optimization makes your searches run faster.
Tips for Faster Searching
-
Use locate for name searches: It's indexed!
1locate filename # Instead of: find / -name "filename" -
Limit search depth:
1find . -maxdepth 3 -name "*.txt" -
Search specific directories:
1find /var/log -name "*.log" # Not: find / -name "*.log" -
Use -prune to skip directories:
1find . -path "./node_modules" -prune -o -name "*.js" -print -
Batch with exec +:
1find . -name "*.txt" -exec cat {} + # Faster than \; -
Use xargs with parallel:
1find . -name "*.gz" | xargs -P 4 gunzip # 4 parallel processes
Part 2: Compression and Archives
Now that you can find files, let's learn to compress them and bundle them into archives.
File Compression: Making Things Smaller
File compression reduces the size of files by finding patterns and encoding them more efficiently. It's like finding a shorter way to say the same thing—instead of "the the the the the", you say "5x the".
Why Compress?
- Save disk space: Store more in less
- Faster transfers: Smaller files upload/download quicker
- Reduce bandwidth: Less data to send
- Organize backups: Combine many files into one
Types of Compression
| Type | Description | Use Case |
|---|---|---|
| Lossless | No data lost, perfect reconstruction | Text, code, documents |
| Lossy | Some data lost, smaller files | Images, audio, video |
Linux compression tools are typically lossless—you get back exactly what you put in!
The Gzip Command: Fast and Popular
The gzip command (GNU zip) is the most common compression tool on Linux. It's fast, effective, and works with a single file at a time.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | |
Gzip Options
| Option | Purpose |
|---|---|
-k |
Keep original file |
-v |
Verbose (show compression ratio) |
-1 to -9 |
Compression level (1=fast, 9=best) |
-r |
Recursive (compress directory contents) |
-f |
Force (overwrite existing) |
-t |
Test integrity |
-l |
List compressed file info |
The Gunzip Command: Uncompress
The gunzip command decompresses .gz files back to their original form.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | |
Shortcut: zcat
Use zcat to view compressed files without decompressing them:
1 2 3 | |
The Bzip2 Command: Better Compression
The bzip2 command provides better compression than gzip, but it's slower. Use it when file size matters more than speed.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | |
Compression Comparison
| Algorithm | Extension | Speed | Compression | Best For |
|---|---|---|---|---|
| gzip | .gz |
Fast | Good | Daily use, logs |
| bzip2 | .bz2 |
Medium | Better | Archives, distribution |
| xz | .xz |
Slow | Best | Long-term storage |
The Xz Command: Maximum Compression
The xz command provides the best compression ratios but is the slowest. It's great for archiving files you rarely access.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 | |
Patience Required
xz -9e on a large file can take a LONG time. The compression is amazing, but grab a coffee while you wait!
Compression Ratios: Understanding the Numbers
Compression ratios tell you how much smaller your file got. A 50% ratio means the compressed file is half the original size.
1 2 3 4 5 6 | |
What Compresses Well?
| Content Type | Compresses Well? | Typical Ratio |
|---|---|---|
| Text files | Excellent | 70-90% |
| Log files | Excellent | 80-95% |
| Source code | Very good | 60-80% |
| HTML/CSS | Very good | 70-85% |
| PDF files | Poor | 5-15% |
| Images (PNG, JPG) | Very poor | 1-5% |
| Compressed files | None | 0% (might grow!) |
Don't Compress Compressed Files
Compressing a .jpg or .mp3 won't make it smaller—it might even get bigger! These formats are already compressed.
File Archiving: Bundling Files Together
File archiving combines multiple files into a single file. This is different from compression—archiving just bundles, compression makes it smaller.
The distinction matters:
- tar: Creates archives (bundles files)
- gzip/bzip2/xz: Compresses files
- tar + gzip: Archives AND compresses (
.tar.gz)
The Tar Command: The Tape Archiver
The tar command (tape archiver) is the standard tool for creating archives on Unix systems. Despite the name referencing ancient tape drives, it's still the go-to tool today!
1 2 3 4 5 6 7 8 | |
Tar Create: Making Archives
Tar create packages files and directories into a single .tar file.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | |
Tar Extract: Unpacking Archives
Tar extract unpacks archive contents.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | |
Tar Options: The Important Flags
Tar options control how tar behaves. Here are the essential ones:
| Option | Meaning |
|---|---|
-c |
Create archive |
-x |
Extract archive |
-t |
List (table of contents) |
-v |
Verbose (show files being processed) |
-f |
File (followed by filename) |
-z |
Use gzip compression |
-j |
Use bzip2 compression |
-J |
Use xz compression |
-p |
Preserve permissions |
-C |
Change to directory before operation |
--exclude |
Exclude patterns |
The tar.gz Dance (Memory Trick)
1 2 3 4 | |
Archive Formats: Know Your Extensions
Archive formats differ in their capabilities and compatibility:
| Extension | Format | Notes |
|---|---|---|
.tar |
Tar only | No compression |
.tar.gz, .tgz |
Tar + gzip | Most common on Linux |
.tar.bz2, .tbz2 |
Tar + bzip2 | Better compression |
.tar.xz, .txz |
Tar + xz | Best compression |
.zip |
Zip | Cross-platform standard |
.7z |
7-Zip | Excellent compression |
.rar |
RAR | Windows common, proprietary |
The Zip Command: Cross-Platform Archives
The zip command creates archives compatible with Windows, macOS, and Linux. It compresses and archives in one step!
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 | |
Zip Options
| Option | Purpose |
|---|---|
-r |
Recursive (include subdirectories) |
-u |
Update (add new/modified files) |
-m |
Move (delete originals after zipping) |
-e |
Encrypt with password |
-0 to -9 |
Compression level |
-x |
Exclude pattern |
-s size |
Split archive |
The Unzip Command: Extract Zip Archives
The unzip command extracts .zip archives.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 | |
The 7zip Command: Maximum Power
The 7zip command (7z) offers excellent compression and supports many formats.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 | |
7zip Supports Many Formats
1 2 3 4 5 | |
Real-World Case Study: Image Optimization for Web Books
Here's a practical example that combines everything you've learned. This workflow finds large images in a book project and compresses them to web-appropriate sizes.
The Problem
When creating online books (like MkDocs sites), images from cameras or design tools are often too large. A 4MB PNG file makes your website slow and wastes bandwidth. The goal: shrink images larger than 300KB down to approximately 300KB while maintaining quality.
The Shell Wrapper Script
This bash script validates the environment and calls a Python compression tool. It demonstrates:
- Environment variable validation (
BK_HOME) - Dependency checking (Python, Pillow library)
- Color-coded output for user feedback
- Argument forwarding to the Python script
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 | |
The Python Compression Tool
The Python script does the heavy lifting. Key features:
- Find large images: Scans for files over 500KB
- Iterative compression: Resizes progressively until target size is reached
- Format conversion: Converts JPG to PNG, preserves transparency
- Backup creation: Saves originals with
.backupextension - EXIF handling: Auto-rotates based on camera orientation
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 | |
Using the Tool
1 2 3 4 5 6 7 8 | |
How It Works
- Find: Uses
os.walk()to recursively find images over 500KB - Sort: Processes largest files first
- Iterate: Tries progressively smaller scale factors (0.9, 0.8, 0.7...)
- Check: After each resize, checks if file is under 300KB
- Stop: When target is achieved, moves to next image
- Report: Provides summary of savings
This is the power of combining shell scripts with Python—the shell handles environment validation and user feedback, while Python handles the complex image processing!
Quick Reference: Search and Compression
Search Cheat Sheet
| Task | Command |
|---|---|
| Find by name | find . -name "*.txt" |
| Find large files | find . -size +100M |
| Find recent files | find . -mtime -7 |
| Find and delete | find . -name "*.tmp" -delete |
| Fast name search | locate filename |
| Find command | which python |
Compression Cheat Sheet
| Task | Command |
|---|---|
| Compress with gzip | gzip file.txt |
| Decompress gzip | gunzip file.txt.gz |
| Create tar.gz | tar -czvf archive.tar.gz folder/ |
| Extract tar.gz | tar -xzvf archive.tar.gz |
| Create zip | zip -r archive.zip folder/ |
| Extract zip | unzip archive.zip |
Key Takeaways
You're now a search and compression master!
- find: The Swiss Army knife of file searching
- locate: Lightning-fast name searches using a database
- which/whereis/type: Find commands and their locations
- gzip/bzip2/xz: Compress single files (speed vs. size trade-off)
- tar: Creates archives; combine with compression for
.tar.gz - zip: Cross-platform archives with built-in compression
Find and Compress Like a Pro!
You can now find any file on your system in seconds, save disk space, and create organized archives. These skills will save you hours of hunting and gigabytes of storage!
What's Next?
Congratulations on completing this chapter! You now have the tools to find anything on your system and manage files efficiently. Keep practicing these commands—they become second nature with use!
Quick Quiz: Search and Compression
- What command finds all .log files modified in the last week?
- Why is locate faster than find?
- How do you run a command on every file that find discovers?
- What's the difference between gzip and tar?
- How would you create a compressed archive of a folder?
- What does the
-9option do for compression tools? - How would you find the largest files on your system?
Quiz Answers
find . -name "*.log" -mtime -7- locate searches a pre-built database instead of scanning the filesystem
- Use
-exec:find . -name "*.txt" -exec command {} \; - gzip compresses a single file; tar bundles multiple files into one archive (no compression by itself)
tar -czvf archive.tar.gz folder/- Maximum compression (slowest but smallest files)
find / -type f -exec du -h {} + 2>/dev/null | sort -rh | head -20
References
- find Command Examples - 35 practical find command examples for daily use
- Advanced find Usage - DigitalOcean guide to finding files on Linux
- locate Command Tutorial - Using locate for fast file searches
- which vs whereis vs locate - Understanding different search command purposes
- Find with Exec Examples - Using find -exec for batch operations
- Regex Patterns with find - Using regular expressions in file searches
- File Search Optimization - Linux Journey's guide to efficient file searching
- updatedb Configuration - Configuring the locate database
- tar Command Guide - GNU tar official documentation with comprehensive examples
- gzip Tutorial - Official gzip compression manual
- Understanding Compression Algorithms - How different compression methods work
- zip vs tar.gz Comparison - When to use different archive formats
- 7-Zip Documentation - Official 7z format and command documentation
- xz Compression Guide - Official xz compression utility documentation
- File Compression Benchmarks - Speed and compression ratio comparisons
- tar Archive Best Practices - Red Hat guide to backing up directories with tar
- bzip2 Compression Tutorial - Official bzip2 documentation and usage
- Archive Formats Comparison - Comprehensive list of archive formats and their uses
- Pillow Image Library - Python imaging library documentation for image compression
- rsync for Backups - Using rsync for efficient file synchronization and backups