Thursday, January 17, 2008

Command line tools for dummies: Stringing them together

In the last blog entry we introduced a few basic UNIX command line tools. Now we'll start stringing them together to make them really useful.

We'll use the following 3-column tab delimited file containing:
line1 A Mike
line2 B Andy
line3 D Chang
line4 A Tom
line5 B Mike
line6 C Brad

All UNIX command line tool read from stdin (standard in) and output to stdout (standard out) by default. This allows us to string these commands together so the output of one command is used for input for the next command. On both UNIX and Windows, you use the pipe character (aka vertical bar, aka |) to string commands together.

So "grep Mike file.txt | wc –l" will first find lines with Mike in the file named file.txt and then send those lines (2 of them in our case) on to the wc command, which in turn will count them and output the result (2 in our case). Let’s try a few quick examples.

Question:
How many different grades (column 2) did we hand out?
Answer:
cut –f 2 | file.txt | sort -u | wc –l
4
Explained:
Cut field 2 | sort them uniquely (-u removes duplicates) | count how many. An alternative solution is of course: "
cut –f 2 file.txt | sort | uniq | wc –l", but "sort -u" is shorter than "sort uniq".

Question:
What was the distribution of grades (column 2)
Answer:
cut –f 2 file.txt | sort | uniq -c
2 A
2 B
1 C
1 D
Explained:
Cut field 2 | sort them | unique them (remove duplicates) | add a count (note that uniq expects sorted input).


Question:
Same question as the above, but if you have a rather large file with many values, then your distribution will get very long. Often you may just want to see the top 2.
Answer:
cut –f 2 file.txt | sort | uniq -c | sort -rn | head -2
2 A
2 B

Explained:
Cut field 2 | sort them | unique them (remove duplicates) | sort results in reverse and treat as numeric (since we want to sort the added count) | show first 2.

Question:
What grades (column 2) did Mike get?
Answer:
cut –f 2,3 file.txt | grep Mike | cut –f 1
A
B
Explained:
Cut field 2 and 3 (otherwise we could potentially find Mike elsewhere) | find only lines containing Mike | cut out the grade (now in field 1 of the 2 fields we have left).


In the next blog, we'll look at handling multiple input files. But for now, practice with the above.

Nifty.

No comments: