Text processing and filtering

uniq

When you're extracting fields with cut and awk, you sometimes want to output just the unique values. There's a uniq primitive for this, but uniq only suppresses duplicate lines that follow one right after the other. Therefore, you must typically sort the output before handing it off.

For example, to get a list of all users with processes running on the current system, use the following command:

$ ps -ef | awk '{print $1}' | sort | uniq
apache
dbus
dovecot
...

sort | uniq is such a common idiom that the sort command has a -u flag that does the same thing. Thus, you could rewrite the above example as

ps -ef | awk '{print $1}' | sort -u

The uniq program has lots of useful options. For example, uniq -c counts the total number of lines merged, and you could use this to report the number of processes running as each user, as in the following command:

$ ps -ef | awk '{print $1}' | sort | uniq -c
         8 apache
         1 dbus
         8 dovecot
...

And, with the use of another sort command, you could sort that output by the number of processes:

$ ps -ef | awk '{print $1}' | sort | uniq -c | sort -nr
      121 root
       11 hal
        8 dovecot
        8 apache
...

Another useful trick is uniq -d, which only shows lines that are repeated (duplicated) and doesn't show unique lines. For example, if you want to detect duplicate UIDs in your password file, enter:

$ cut -d: -f3 /etc/passwd | sort -n | uniq -d

In this case, I didn't get any output – no duplicate UIDs – which is exactly what I want to see.

By the way, a uniq -u command will output only the unique (non-duplicated) lines in your output, but I don't find myself using this option often.

paste and join

Sometimes you want to glue multiple input files together. The paste command simply combines two files on a line-by-line basis, with tab as the delimiter by default. For example, suppose you had a file, capitals, containing capital letters and another file, lowers, containing the letters in lower case. To paste these files together, use:

$ paste capitals lowers
A       a
B       b
C       c
...

Or, if you wanted to use something other than tab as the delimiter, use:

$ paste -d, capitals lowers
A,a
B,b
C,c
...

But it's not really that common to want to glue files together on a line-by-line basis. More often you want to match up lines on some particular field, which is what the join command is for. The join command can get pretty complicated, so I'll provide a simple example that uses files of letters.

To put line numbers at the beginning of each line in the files, use the nl program:

$ nl capitals
     1  A
     2  B
     3  C
...

The join command could then stitch together the resulting files by using the line numbers as the common field:

$ join <(nl capitals) <(nl lowers)
1 A a
2 B b
3 C c
...

Notice the <(…) Bash syntax, which means, substitute the output of a command in this place where a file name would normally be used. For some reason, when I'm using join, life is never this easy. Some crazy combination of fields and delimiters always seems to be the result. For example, suppose I had one CSV file that listed the top 20 most populous countries along with their populations:

1,China,1330044544
2,India,1147995904
3,United States,303824640
...

And suppose my other file listed the capital cities of all the countries in the world:

Afghanistan,Kabul
Albania,Tirane
Algeria,Algiers
...

What if my task were to connect the capital city information with each of the 20 most populous countries? In other words, I want to glue the information in the two files together with the use of field 2 from the first file and field 1 from the second file.

The complicated thing about join is that it only works if both files are sorted in the same order on the fields you're going to be joining the files on. Normally, I end up doing some presorting on the input files before giving them to join:

$ join -t, -1 2 -2 1 <(sort -t, -k2 most-populous) <(sort cities)
Bangladesh,7,153546896,Dhaka
Brazil,5,196342592,Brasilia
China,1,1330044544,Beijing
...

The options to the join command specify the delimiter I'm using (-t,) and the fields that control the join for the first (-1 2) and second (-2 1) files. Once again, I'm using the <(…) Bash syntax, this time to sort the two input files appropriately before processing them with join.

The output isn't very pretty. join outputs the joined field first (the country name), followed by the remaining fields from the first file (the ranking and the population), followed by the remaining fields from the second file (the capital city). The cut and sort commands can pretty things up a little bit:

$ join -t, -1 2 -2 1 <(sort -t, -k2 most-populous) <(sort cities) |cut -d, -f1,3,4 | sort -nr -t, -k2
China,1330044544,Beijing
India,1147995904,New Delhi
United States,303824640,Washington D.C.
...

Examples like this are where you really start to get a sense of just how powerful the text-processing capabilities of the operating system are.

Buy this article as PDF

Express-Checkout as PDF

Pages: 6

Price $2.95
(incl. VAT)

Buy Raspberry Pi Geek

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content