Monday, February 25, 2013

Grep for value in a column

When doing data analysis, you often need to find records with a given value in a given column, but you want to return the whole line. The grep command will be able to do some of that but has a number of drawbacks. See older post about that at
http://niftypctricks.blogspot.com/2008/08/greping-for-column-value.html

A much simpler (and cleaner) way is to use the awk or gawk command from the command line. Note that the awk/gawk command line may look slightly different depending on whether you are on Linux, Cygwin, MKS toolkit, DOS, etc.

Assuming we have a sample tab delimited text file named in.txt:
line1 joe smith ca 4085551212
line2 joe carlson az 3334445555
line3 carl smith ny 2049998888
line4 joe smith or 5035551234
line5 mike erwin ca 4159876543
line6 mike erwin CA 4159876543

To find all records when column 5 (phone number) contained the value 555, simply use
cat in.txt | gawk -v FS='\t' 'match($5,"555")' > out.txt

The awk / gawk command "match" returns true if pattern1 ($5, meaning column 5) contains pattern2 (555), and by default that prints the whole line, so no need to explicitly put "print $0" in there. So the code sample above will return line 1, 2, and 4.

Nifty #NerdTip,
/T



No comments: