Advanced Text Processing: Sed, Awk, and Pipes
Summary
This chapter covers advanced text processing with sed for stream editing and awk for field-based processing. You'll also master the UNIX philosophy in action: connecting commands with pipes, understanding text streams (stdin, stdout, stderr), and using redirection operators. These skills enable you to build powerful data processing pipelines.
Concepts Covered
This chapter covers the following 21 concepts from the learning graph:
- Sed Command
- Sed Substitution
- Awk Command
- Awk Fields
- Awk Patterns
- Text Streams
- Standard Input
- Standard Output
- Standard Error
- Redirection
- Output Redirection
- Input Redirection
- Append Redirection
- Error Redirection
- Pipe Operator
- Pipeline Commands
- Xargs Command
- Tee Command
- Tr Command
- Rev Command
- Fold Command
Prerequisites
This chapter builds on concepts from:
- Chapter 5: File Operations and Manipulation
- Chapter 8: Text Processing with Grep and Regular Expressions
The UNIX Philosophy: Small Tools, Big Power
Here's a secret that made UNIX (and Linux) legendary: instead of building giant programs that try to do everything, UNIX developers created small, focused tools that each do ONE thing really well. Then they invented a way to connect these tools together—like LEGO bricks—to build whatever you need!
This chapter teaches you:
- Streams: How data flows between programs
- Redirection: Sending output to files instead of the screen
- Pipes: Connecting commands together
- Sed: A stream editor for transforming text
- Awk: A powerful language for processing structured data
By the end, you'll be building pipeline commands that process text like a pro. These skills are what separate Linux beginners from power users!
Text Streams: The Rivers of Data
In Linux, data flows through programs like water through pipes. These flows are called text streams, and every program has three of them by default.
Standard Input (stdin)
Standard input (stdin) is where a program reads its input. By default, this is your keyboard—when you type, you're sending data to stdin.
1 2 3 4 | |
Standard Output (stdout)
Standard output (stdout) is where a program writes its normal output. By default, this goes to your terminal screen.
1 2 3 4 5 6 7 | |
Standard Error (stderr)
Standard error (stderr) is where programs write error messages. It's separate from stdout so you can handle errors differently from regular output.
1 2 3 4 | |
Why Separate Streams?
Having separate streams is brilliant:
- You can save normal output to a file while still seeing errors on screen
- You can discard errors while keeping the good output
- You can log errors to a separate file
- Pipes only connect stdout, so errors still show up for debugging
| Stream | Number | Default | Purpose |
|---|---|---|---|
| stdin | 0 | Keyboard | Input to program |
| stdout | 1 | Screen | Normal output |
| stderr | 2 | Screen | Error messages |
Diagram: Text Streams Visualization
Understanding stdin, stdout, and stderr
Type: diagram
Bloom Taxonomy: Understand Learning Objective: Visualize how the three standard streams connect programs to input/output devices and how they can be redirected.
Layout: Central program box with three streams flowing in/out
Visual elements: - Central box: "Program" (e.g., grep) - Left side: stdin (stream 0) flowing IN - Default source: keyboard icon - Can be redirected from: file icon - Right top: stdout (stream 1) flowing OUT - Default destination: terminal/screen icon - Can be redirected to: file icon - Right bottom: stderr (stream 2) flowing OUT - Default destination: terminal/screen icon - Can be redirected to: different file icon
Stream labels: - stdin (0): Blue arrow, "Input" - stdout (1): Green arrow, "Output" - stderr (2): Red arrow, "Errors"
Animation: - Show data packets flowing through streams - Demonstrate redirection by arrows changing targets - Show pipe connecting stdout to another program's stdin
Interactive features: - Click to toggle between default and redirected states - Hover over streams for explanation
Color scheme: - stdin: Blue - stdout: Green - stderr: Red - Program box: Gray - Files: Yellow/orange
Implementation: p5.js
Redirection: Controlling Where Data Goes
Redirection lets you change where stdin comes from and where stdout/stderr go to. Instead of keyboard input and screen output, you can use files!
Output Redirection (>)
Output redirection sends stdout to a file instead of the screen. The > operator creates a new file (or overwrites an existing one):
1 2 3 4 5 6 7 8 | |
Append Redirection (>>)
Append redirection adds to the end of a file instead of overwriting:
1 2 3 4 5 6 7 8 9 | |
Input Redirection (<)
Input redirection makes a program read from a file instead of the keyboard:
1 2 3 4 5 6 7 8 9 | |
Error Redirection (2>)
Error redirection sends stderr to a file. Remember, stderr is stream number 2:
1 2 3 4 5 6 7 8 9 10 | |
Combining Redirections
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | |
The Black Hole: /dev/null
/dev/null is a special file that discards everything written to it. It's like a black hole for data! Use it when you want to run a command but don't care about its output:
1 2 | |
Redirection Summary
| Operator | Name | Effect |
|---|---|---|
> |
Output redirect | Write stdout to file (overwrite) |
>> |
Append redirect | Write stdout to file (append) |
< |
Input redirect | Read stdin from file |
2> |
Error redirect | Write stderr to file (overwrite) |
2>> |
Error append | Write stderr to file (append) |
&> |
All redirect | Write stdout AND stderr to file |
2>&1 |
Stderr to stdout | Send stderr where stdout goes |
The Pipe Operator: Connecting Commands
The pipe operator (|) is where the magic happens! It connects the stdout of one command to the stdin of another, creating a pipeline.
1 2 3 4 5 6 7 8 | |
Think of it like actual plumbing: the output "flows" from one command into the next!
Pipeline Commands: Building Data Processing Chains
Pipeline commands let you combine simple tools to do complex tasks:
1 2 3 4 5 6 7 8 9 10 11 | |
Real Pipeline Examples
1 2 3 4 5 6 7 8 | |
Pipelines Don't Include stderr
By default, pipes only pass stdout. Error messages still go to your screen, which is usually what you want for debugging. To pipe stderr too, use 2>&1 | or |& (bash).
Diagram: Pipeline Flow
Understanding Command Pipelines
Type: diagram
Bloom Taxonomy: Understand, Apply Learning Objective: Visualize how data flows through a pipeline of commands, with each command's output becoming the next command's input.
Layout: Horizontal chain of connected command boxes
Example pipeline: cat file.txt | grep "error" | sort | uniq -c
Visual elements: 1. Source (file icon): file.txt 2. Command boxes connected by pipe symbols: - cat: Reads file, outputs all lines - grep "error": Filters to error lines only - sort: Alphabetizes lines - uniq -c: Counts unique lines 3. Final output (screen icon): Results
Between each stage, show: - Pipe symbol (|) - Data preview (sample lines at that stage) - Line count (how many lines at each stage)
Example data flow: - cat output: 1000 lines (all content) - grep output: 23 lines (only errors) - sort output: 23 lines (alphabetized) - uniq -c output: 5 lines (unique errors with counts)
Animation: - Data "packets" flowing left to right - Each command box lights up as it processes - Output appears on right side
Interactive features: - Click each command to see its effect - Show intermediate output at each stage - Toggle different example pipelines
Color scheme: - Commands: Blue boxes - Data flow: Green arrows - Pipe symbols: Gray - Filtered data: Yellow highlight
Implementation: p5.js
The Tee Command: Output to File AND Screen
The tee command is like a plumbing T-junction: it sends output to BOTH a file AND stdout (so the pipeline can continue):
1 2 3 4 5 6 7 8 9 10 11 | |
Use tee when you want to:
- Debug a pipeline by saving intermediate results
- Watch progress while logging
- Send output to multiple destinations
The Tr Command: Translate Characters
The tr command (translate) transforms characters—replacing, deleting, or squeezing them:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 | |
Tr Character Classes
1 2 3 4 5 6 7 8 9 | |
| Class | Matches |
|---|---|
[:lower:] |
Lowercase letters |
[:upper:] |
Uppercase letters |
[:digit:] |
Digits 0-9 |
[:alpha:] |
All letters |
[:alnum:] |
Letters and digits |
[:space:] |
Whitespace |
[:punct:] |
Punctuation |
The Rev Command: Reverse Lines
The rev command reverses each line character by character:
1 2 3 4 5 6 7 8 | |
While it seems simple, rev can be useful in pipelines:
1 2 3 | |
The Fold Command: Wrap Long Lines
The fold command wraps long lines at a specified width:
1 2 3 4 5 6 7 8 9 10 | |
Useful for:
- Formatting text for display
- Preparing text for printing
- Making files easier to read in terminals
The Xargs Command: Build Commands from Input
The xargs command takes input and turns it into arguments for another command. It's incredibly powerful for batch operations!
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | |
Safe Xargs with -0
When filenames contain spaces or special characters, use -print0 with find and -0 with xargs:
1 2 3 4 5 | |
Xargs with Custom Commands
1 2 3 4 5 6 7 8 9 10 | |
The Sed Command: Stream Editor
The sed command (stream editor) transforms text as it flows through. It's like find-and-replace on steroids!
Sed Substitution: Find and Replace
Sed substitution is the most common sed operation. The basic syntax is s/pattern/replacement/:
1 2 3 4 5 6 7 8 9 10 11 | |
Sed Substitution Patterns
1 2 3 4 5 6 7 8 9 10 11 12 13 | |
More Sed Operations
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | |
Sed Address Ranges
1 2 3 4 5 6 7 8 9 10 11 | |
Practical Sed Examples
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 | |
Test Before Editing In-Place
Always test your sed command WITHOUT -i first to see the output. Once you're sure it's correct, add -i to modify the file:
1 2 | |
The Awk Command: Pattern Scanning and Processing
The awk command is actually a complete programming language designed for text processing! It's especially powerful for working with structured data (like CSV files or log files).
Awk Fields: Column-Based Processing
Awk fields are columns of text, separated by whitespace (by default). Awk automatically splits each line into fields:
$0= entire line$1= first field$2= second field- And so on...
1 2 3 4 5 6 7 8 9 10 11 12 | |
Custom Field Separator
1 2 3 4 5 6 7 8 9 | |
Awk Patterns: Conditional Processing
Awk patterns let you process only lines that match a condition:
1 2 3 4 5 6 7 8 9 10 11 | |
Awk Built-in Variables
| Variable | Meaning |
|---|---|
NR |
Number of Records (current line number) |
NF |
Number of Fields in current line |
$0 |
Entire current line |
$1, $2... |
Individual fields |
FS |
Field Separator |
RS |
Record Separator |
OFS |
Output Field Separator |
1 2 3 4 5 6 7 8 9 10 11 | |
Awk BEGIN and END
1 2 3 4 5 6 7 8 9 10 11 | |
Practical Awk Examples
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | |
Diagram: Awk Field Processing
Understanding Awk Fields
Type: microsim
Bloom Taxonomy: Apply, Analyze Learning Objective: Allow students to visualize how awk splits lines into fields and select specific columns.
Canvas layout (responsive, ~750px max width): - Top section (80px): Input line with field separators marked - Middle section (200px): Field boxes showing $1, $2, $3, etc. - Bottom section (100px): Output based on selected fields
Visual elements: - Input line: "john:x:1000:1000:John Doe:/home/john:/bin/bash" - Field separator dropdown: (space, :, ,, tab) - Field boxes highlighting each parsed field - Output preview
Interactive controls: - Dropdown: Select field separator - Checkboxes: Select which fields to output ($1, $2, etc.) - Text input: Custom awk command - Sample input selector: (passwd file, CSV data, ps output)
Sample data options: 1. /etc/passwd format: john:x:1000:1000:John Doe:/home/john:/bin/bash 2. CSV format: Alice,Engineering,75000,Seattle 3. ps aux format: root 1 0.0 0.1 169836 13460 ? Ss Dec01 0:11 /sbin/init
Behavior: - Changing separator re-parses the line - Selected fields appear in output - Show NF value - Show generated awk command
Color scheme: - $1: Red - $2: Orange - $3: Yellow - $4: Green - $5: Blue - $6: Purple - Field separator: Gray highlighted
Implementation: p5.js
Putting It All Together: Text Processing Workflows
Let's combine everything to solve real problems!
Workflow 1: Log Analysis
1 2 3 4 5 | |
Workflow 2: CSV Processing
1 2 3 4 5 | |
Workflow 3: Text Cleanup
1 2 3 4 5 6 | |
Workflow 4: System Information
1 2 3 4 5 6 7 8 9 | |
Sed and Awk Cheat Sheet
Sed Quick Reference
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | |
Awk Quick Reference
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | |
Key Takeaways
You've learned the power of UNIX text processing!
- Text streams: stdin, stdout, stderr—the three channels of data flow
- Redirection:
>,>>,<,2>control where data goes - Pipes:
|connects commands into powerful pipelines - Sed: Stream editor for find/replace and text transformation
- Awk: Field-based processing with pattern matching
- Tr: Character translation and deletion
- Xargs: Convert input to command arguments
- Tee: Split output to file and screen
- Rev/Fold: Reverse and wrap text
You're a Text Processing Master!
These tools embody the UNIX philosophy: small, focused programs combined to do big things. Practice building pipelines, and you'll find elegant solutions to problems that would take dozens of lines of code in other languages!
What's Next?
Now that you can process text like a pro, it's time to learn about text editors! Next chapter covers nano and vim—the editors you'll use to write scripts, edit configs, and work on code.
Quick Quiz: Sed, Awk, and Pipes
- What does
>do differently than>>? - What stream number is stderr?
- How do you print the third field in awk?
- What does
sed 's/old/new/g'do? - What does the
teecommand do? - How do you make sed edit a file in place?
Quiz Answers
>overwrites the file;>>appends to the file- stderr is stream 2
awk '{print $3}'prints the third field- Replaces ALL occurrences of "old" with "new" on each line (g = global)
- tee sends output to BOTH a file AND stdout (continues the pipeline)
sed -i 's/old/new/g' file- the -i flag edits in place
References
- GNU sed Manual - Official documentation for sed stream editor with comprehensive examples and advanced features.
- GNU awk Manual - Complete reference for awk programming language including patterns, actions, and built-in functions.
- The UNIX Philosophy - Wikipedia article explaining the design principles behind small, composable UNIX tools.
- Sed - An Introduction and Tutorial by Bruce Barnett - Comprehensive sed tutorial covering basic to advanced usage with practical examples.
- Awk - A Tutorial and Introduction by Bruce Barnett - Detailed awk tutorial explaining field processing, patterns, and programming constructs.
- Linux Pipes and Redirection Tutorial - Educational guide to understanding stdin, stdout, stderr, and redirection operators.
- Advanced Bash-Scripting Guide: I/O Redirection - In-depth coverage of redirection techniques and file descriptors.
- Text Processing Commands - Linux Journey - Beginner-friendly introduction to text processing tools and regular expressions.
- Sed and Awk 101 Hacks eBook - Practical examples and tips for sed and awk mastery.
- Understanding Linux File Descriptors - Explains stdin (0), stdout (1), and stderr (2) in detail.
- Xargs Tutorial with Examples - TecMint guide showing how to use xargs for batch operations.
- The Tee Command Explained - Tutorial on using tee for splitting output to files and pipelines.
- Tr Command Tutorial - GeeksforGeeks guide to character translation and deletion with tr.
- Sed by Example Part 1 - IBM Developer tutorial series on practical sed usage.
- Awk by Example - IBM Developer tutorial introducing awk fundamentals for text processing.
- Linux Pipeline Tutorial - Ryan's Tutorials guide to building command pipelines.
- Regular Expressions in Sed and Awk - DigitalOcean tutorial on regex patterns for text matching.
- Advanced Awk: Arrays and Functions - GNU documentation on awk's advanced programming features.
- Sed Advanced Topics - GNU guide to sed's hold space, branching, and multi-line operations.
- Unix Power Tools: Text Processing - O'Reilly book chapter covering sed, awk, and pipeline techniques for real-world problems.