Saturday, August 2, 2008

Grep'ing for a column value

A few days ago I needed to get those lines from a file that only contained a certain value in a certain field. To simplify the example, let's say that I have a file containing id, first name, last name, state, and phone number, and that I am only interested in those entries containing a state of California (e.g. ca or CA).

Here is the sample file named in.txt:
line1 joe smith ca 4085551212
line2 joe carlson az 3334445555
line3 carl smith ny 2049998888
line4 joe smith or 5035551234
line5 mike erwin ca 4159876543
line6 mike erwin CA 4159876543

I can not simply do a "grep -i ca in.txt" as I would get all lines containing "ca" even if "ca" occurred in the name fields (e.g. like in carl or carlson). In the above case I would get all lines, except line 4, which is of course incorrect.

If I did "cut -f4 in.txt | grep -i ca" I would get the correct number of results (i.e. 3), but I would only get the 3 "ca" values and not the whole line as I cut everything else away.

So one way to solve the problem (there may be other more clever ones I'm not aware of), is to use the "awk" command instead of the "grep" command.
cat in.txt | awk '{if ($4=="ca") print $0;}'

I think the above is self explanatory even for non-awk folks, the only trick is that $0 shows the entire line. If you only wanted to get certain columns in the output, you could decide to only print $2,$3,$5 for example to get first name, last name and phone number.

If you want to make the solution case insensitive, simply use the awk command "tolower()" in your if statement:
cat in.txt | awk '{if (tolower($4)=="ca") print $0;}'

Nifty.