Text processing and filtering

Lead Image © redrockerz, 123RF.com

Tackling Text

Enjoy a crash course on some of the text-processing and -filtering capabilities found in Linux.

Unix-like operating systems have historically been very much about text processing. Really, the Unix design religion is: Make simple tools whose output can be manipulated by others with the use of pipes and other forms of output redirection. In this article, I'll look at the wealth of Linux command-line tools for combining, selecting, extracting, and otherwise manipulating text.

wc

The wc (word count) command is a simple filter that you can use to count the number of lines, characters (bytes), and, yes, even the number of words in a file. Whereas counting lines and bytes tends to be useful, I rarely find myself using wc to count words.

You can count lines in a file with wc -l:

$ wc -l kern.log
1026 kern.log

If you don't specify a file name, wc will also read the standard input. To exploit this feature, use the following useful idiom for counting the number of files in a directory:

$ ls | wc -l
138

To count the number of bytes in a file, use wc -c:

$ wc -c kern.log
106932 kern.log

On a single file, wc -c isn't necessarily that interesting because you could see the same information in the output of ls -l. However, if you combine wc with the find command, you get byte counts for all files in an entire directory tree:

$ find /var/log -type f -exec wc -c {} \;
79666 /var/log/kern.log.6.gz
3781 /var/log/dpkg.log.4.gz
106932 /var/log/kern.log
...

After I examine a few more shell tricks in the sections that follow, I'll return to this example.

head and tail

Another pair of simple text-processing filters are head and tail, which extract the first 10 or the last 10 lines from their input, respectively. Also, you can specify a larger or smaller number of lines. For example, to obtain the name of the most recently modified file in a directory, use:

$ ls -t | head -1
kern.log

Then if you wanted to see the last few lines of that file, use:

$ tail -3 kern.log
Nov 21 09:00:19 elk kernel: [11936.090452] [UFW BLOCK INPUT]: IN=eth0 OUT=...
Nov 21 09:00:21 elk kernel: [11938.083655] [UFW BLOCK INPUT]: IN=eth0 OUT=...
Nov 21 09:00:25 elk kernel: [11942.134431] [UFW BLOCK INPUT]: IN=eth0 OUT=...

Here's a trick for extracting a particular line from a file by piping head into tail:

$ head -13 /etc/passwd | tail -1
www-data:x:33:33:www-data:/var/www:/bin/sh

In this case, I am extracting the 13th line of /etc/passwd, but you could easily select any line just by changing the numeric argument that is passed in to the head command.

Another useful feature of the tail command is the -f option, which displays the last 10 lines of the file as usual, but then keeps the file open and displays any new lines that are appended onto the end of the file. This technique is particularly useful for keeping an eye on logfiles – for example, tail -f kern.log.

Buy this article as PDF

Express-Checkout as PDF

Pages: 6

Price $2.95
(incl. VAT)

Buy Raspberry Pi Geek

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content