Text processing and filtering

split

Joining files together is all well and good, but sometimes you want to split them up. For example, I might split my password-cracking dictionary into smaller chunks so that I can farm out the processing across multiple systems:

$ split -d -l 1000 dictionary dictionary.
$ wc -l *
  98569 dictionary
   1000 dictionary.00
   1000 dictionary.01
   1000 dictionary.02
...

Here, I'm splitting the file called dictionary into 1000-line chunks (-l 1000, is actually the default) and assigning dictionary as the base name of the resulting files. Then, I want split to use numeric suffixes (-d) rather than letters, and I use wc -l to count the number of lines in each file and confirm that I got what I wanted.

Note that you can also specify a dash (-), meaning standard input, instead of a file name. This approach can be useful when you want to split the output of a very verbose command into manageable chunks (e.g., tcpdump | split -d -l 100000 - packet-info).

tr

The tr command allows you to transform one set of characters into another. The classic example is mapping uppercase letters to lowercase. For this example, to transform the capitals file I used previously, I'll use:

$ tr A-Z a-z < capitals
a
b
c
...

But this is a rather silly example. A more useful task for tr is this little hack for looking at data under /proc:

$ cd /proc/self
$ cat environ
GNOME_KEYRING_SOCKET=/tmp/keyring-lFz8t4/socketLOGNAME=halGDMSESSION=default...
$ tr \\000 \\n <environ
GNOME_KEYRING_SOCKET=/tmp/keyring-lFz8t4/socket
LOGNAME=hal
GDMSESSION=default
...

Typically, /proc data are delimited with nulls (ASCII zero), so when you dump /proc to the terminal, everything just runs together, as shown in the output of the cat command above.

By converting the nulls (\000) to newlines (\n), everything becomes much more readable. (The extra backwhacks (\) in the tr command here are necessary because the shell normally interprets the backslash as a special character. Doubling them up indicates that the backslash should be taken literally.)

Instead of converting one set of characters to another, you can use the -d option simply to delete a particular set of characters from your input. For example, if you don't happen to have a copy of the dos2unix command handy, you can always use tr to remove those annoying carriage returns:

$ tr -d \\r <dos.txt >unix.txt

Or, for a sillier example, here's a way for all you fans of The Matrix to get a spew of random characters in your terminal:

$ tr -d -c [:print:] </dev/urandom

Here I'm using [:print:] to specify the set of printable characters, but I'm also employing the -c (compliment) option, which means all characters not in this set. Thus, I end up deleting everything except the printable characters.

Buy this article as PDF

Express-Checkout as PDF

Pages: 6

Price $2.95
(incl. VAT)

Buy Raspberry Pi Geek

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content