Text processing and filtering
cut and awk
head and tail are useful for selecting particular sets of lines from your input, but sometimes you want to extract particular fields from each input line. The cut command is useful when your input has regular delimiters, such as the colons in /etc/passwd:
$ cut -d: -f1,6 /etc/passwd root:/root daemon:/usr/sbin bin:/bin ...
The -d option specifies the delimiter used to separate the fields on each line, and -f allows you to specify which fields you want to extract. In this case, I'm pulling out the usernames and the home directory for each user. cut also lets you pull out specific sequences of characters by using -c instead of -f. Here's an example that filters the output of ls -l so that you see just the permissions flags and the file name:
$ ls -l | cut -c2-10,52- otal 1540 rwxr-xr-x acpi rw-r--r-- adduser.conf rw-r--r-- adjtime ...
Darn! The output contains the header line from ls -l. Happily, tail will help with this:
$ ls -l | tail -n +2 | cut -c2-10,52- rwxr-xr-x acpi rw-r--r-- adduser.conf rw-r--r-- adjtime ...
That looks better! Notice the syntax with tail here. The -n option is the alternative (POSIX-ly correct) way of specifying the number of lines tail should output. So, tail -10 and tail -n 10 are equivalent. If you prefix the number of lines with +, as in the example above, it means start with the specified line. So, here I'm telling tail to display all lines from the second line onward. The + syntax only works after -n.
cut is wonderful for lots of tasks, but the output of many commands is separated by white space and often irregular. The awk command is best for dealing with this kind of input:
$ ps -ef | awk '{print $1 "\t" $2 "\t" $8}' UID PID CMD root 1 /sbin/init root 2 [kthreadd] root 3 [migration/0] ...
awk automatically breaks up each input line on white space and assigns each field to variables named $1, $2, and so on. awk is a fully functional scripting language with many different capabilities, but at its simplest, you can just use the print command to output particular input fields as I'm doing here.
awk also allows you to select specific lines from your input with the use of pattern matching or other conditional operators, which saves you from first having to filter your input with grep or some other tool. For example, suppose I wanted the filtered ps output above, but only for my own processes:
$ ps -ef | awk '/^hal / {print $1 "\t" $2 "\t" $8}' hal 7445 /usr/bin/gnome-keyring-daemon hal 7460 x-session-manager hal 7566 /usr/bin/dbus-launch ...
Here, I use the pattern match operator (/…/) to produce output only for lines that start with hal<space>. The command ps -ef | awk '($1 == "hal") …' would accomplish the same thing.
You can use the -F option with awk to specify a delimiter other than white space. This lets you use awk in places where you might normally use cut, but where you want to use awk's conditional operators to match specific input lines. Suppose you want to output usernames and home directories as in the first cut example, but only for users with directories under /home:
$ awk -F: '($6 ~ /^\/home\//) { print $1 ":" $6 }' /etc/passwd sabayon:/home/sabayon hal:/home/hal laura:/home/laura
Rather than matching against the entire line, the command here uses the ~ operator pattern match against a specific field only.
sort
Sorting your output is often useful:
$ awk -F: '($6 ~ /^\/home\//) { print $1 ":" $6 }'/etc/passwd | sorthal:/home/hal laura:/home/laura sabayon:/home/sabayon
By default, sort simply sorts alphabetically from the beginning of each line of input. Sometimes numeric sorting is what you want, and sometimes you want to sort on a specific field in each input line.
Here's a classic example that shows how to sort your password file by the user ID field (useful for spotting duplicate UIDs and when somebody has added illicit UID 0 accounts):
$ sort -n -t: -k3 /etc/passwd root:x:0:0:root:/root:/bin/bash daemon:x:1:1:daemon:/usr/sbin:/bin/sh bin:x:2:2:bin:/bin:/bin/sh ...
The -n option indicates a numeric sort, -t specifies the field delimiter (such as cut -d or awk -F), and -k specifies the field(s) to sort on (clearly they were running out of option letters).
Also, you can reverse the sort order with -r to get descending sorts:
$ ls /etc/rc3.d | sort -r S99stop-readahead S99rmnologin S99rc.local ...
Remember the find command that I used wc -c with to get byte counts for all files under a given directory? Well, you can sort that output and then filter with head to get a count of the 10 largest files under your chosen directory:
$ find /var/log -type f -exec wc -c {} \; | sort -nr | head 44962814 /var/log/vnetlib 24748291 /var/log/syslog 24708201 /var/log/mail.log 24708201 /var/log/mail.info 10243792 /var/log/ConsoleKit/history 3902994 /var/log/syslog.0 3782642 /var/log/mail.log.0 3782642 /var/log/mail.info.0 1039348 /var/log/vmware/hostd-7.log 804391 /var/log/installer/partman
Buy this article as PDF
Pages: 6
(incl. VAT)