Text processing and filtering

cut and awk

head and tail are useful for selecting particular sets of lines from your input, but sometimes you want to extract particular fields from each input line. The cut command is useful when your input has regular delimiters, such as the colons in /etc/passwd:

$ cut -d: -f1,6 /etc/passwd
root:/root
daemon:/usr/sbin
bin:/bin
...

The -d option specifies the delimiter used to separate the fields on each line, and -f allows you to specify which fields you want to extract. In this case, I'm pulling out the usernames and the home directory for each user. cut also lets you pull out specific sequences of characters by using -c instead of -f. Here's an example that filters the output of ls -l so that you see just the permissions flags and the file name:

$ ls -l | cut -c2-10,52-
otal 1540
rwxr-xr-x acpi
rw-r--r-- adduser.conf
rw-r--r-- adjtime
...

Darn! The output contains the header line from ls -l. Happily, tail will help with this:

$ ls -l | tail -n +2 | cut -c2-10,52-
rwxr-xr-x acpi
rw-r--r-- adduser.conf
rw-r--r-- adjtime
...

That looks better! Notice the syntax with tail here. The -n option is the alternative (POSIX-ly correct) way of specifying the number of lines tail should output. So, tail -10 and tail -n 10 are equivalent. If you prefix the number of lines with +, as in the example above, it means start with the specified line. So, here I'm telling tail to display all lines from the second line onward. The + syntax only works after -n.

cut is wonderful for lots of tasks, but the output of many commands is separated by white space and often irregular. The awk command is best for dealing with this kind of input:

$ ps -ef | awk '{print $1 "\t" $2 "\t" $8}'
UID     PID     CMD
root    1       /sbin/init
root    2       [kthreadd]
root    3       [migration/0]
...

awk automatically breaks up each input line on white space and assigns each field to variables named $1, $2, and so on. awk is a fully functional scripting language with many different capabilities, but at its simplest, you can just use the print command to output particular input fields as I'm doing here.

awk also allows you to select specific lines from your input with the use of pattern matching or other conditional operators, which saves you from first having to filter your input with grep or some other tool. For example, suppose I wanted the filtered ps output above, but only for my own processes:

$ ps -ef | awk '/^hal / {print $1 "\t" $2 "\t" $8}'
hal     7445    /usr/bin/gnome-keyring-daemon
hal     7460    x-session-manager
hal     7566    /usr/bin/dbus-launch
...

Here, I use the pattern match operator (/…/) to produce output only for lines that start with hal<space>. The command ps -ef | awk '($1 == "hal") …' would accomplish the same thing.

You can use the -F option with awk to specify a delimiter other than white space. This lets you use awk in places where you might normally use cut, but where you want to use awk's conditional operators to match specific input lines. Suppose you want to output usernames and home directories as in the first cut example, but only for users with directories under /home:

$ awk -F: '($6 ~ /^\/home\//) { print $1 ":" $6 }' /etc/passwd
sabayon:/home/sabayon
hal:/home/hal
laura:/home/laura

Rather than matching against the entire line, the command here uses the ~ operator pattern match against a specific field only.

sort

Sorting your output is often useful:

$ awk -F: '($6 ~ /^\/home\//) { print $1 ":" $6 }'/etc/passwd | sorthal:/home/hal
laura:/home/laura
sabayon:/home/sabayon

By default, sort simply sorts alphabetically from the beginning of each line of input. Sometimes numeric sorting is what you want, and sometimes you want to sort on a specific field in each input line.

Here's a classic example that shows how to sort your password file by the user ID field (useful for spotting duplicate UIDs and when somebody has added illicit UID 0 accounts):

$ sort -n -t: -k3 /etc/passwd
root:x:0:0:root:/root:/bin/bash
daemon:x:1:1:daemon:/usr/sbin:/bin/sh
bin:x:2:2:bin:/bin:/bin/sh
...

The -n option indicates a numeric sort, -t specifies the field delimiter (such as cut -d or awk -F), and -k specifies the field(s) to sort on (clearly they were running out of option letters).

Also, you can reverse the sort order with -r to get descending sorts:

$ ls /etc/rc3.d | sort -r
S99stop-readahead
S99rmnologin
S99rc.local
...

Remember the find command that I used wc -c with to get byte counts for all files under a given directory? Well, you can sort that output and then filter with head to get a count of the 10 largest files under your chosen directory:

$ find /var/log -type f -exec wc -c {} \; | sort -nr | head
44962814 /var/log/vnetlib
24748291 /var/log/syslog
24708201 /var/log/mail.log
24708201 /var/log/mail.info
10243792 /var/log/ConsoleKit/history
3902994 /var/log/syslog.0
3782642 /var/log/mail.log.0
3782642 /var/log/mail.info.0
1039348 /var/log/vmware/hostd-7.log
804391 /var/log/installer/partman

Buy this article as PDF

Express-Checkout as PDF

Pages: 6

Price $2.95
(incl. VAT)

Buy Raspberry Pi Geek

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content