Monday, February 25, 2013

Grep for value in a column

When doing data analysis, you often need to find records with a given value in a given column, but you want to return the whole line. The grep command will be able to do some of that but has a number of drawbacks. See older post about that at
http://niftypctricks.blogspot.com/2008/08/greping-for-column-value.html

A much simpler (and cleaner) way is to use the awk or gawk command from the command line. Note that the awk/gawk command line may look slightly different depending on whether you are on Linux, Cygwin, MKS toolkit, DOS, etc.

Assuming we have a sample tab delimited text file named in.txt:
line1 joe smith ca 4085551212
line2 joe carlson az 3334445555
line3 carl smith ny 2049998888
line4 joe smith or 5035551234
line5 mike erwin ca 4159876543
line6 mike erwin CA 4159876543

To find all records when column 5 (phone number) contained the value 555, simply use
cat in.txt | gawk -v FS='\t' 'match($5,"555")' > out.txt

The awk / gawk command "match" returns true if pattern1 ($5, meaning column 5) contains pattern2 (555), and by default that prints the whole line, so no need to explicitly put "print $0" in there. So the code sample above will return line 1, 2, and 4.

Nifty #NerdTip,
/T



Monday, February 11, 2013

Pipe to tab

Converting a pipe delimited file to a tab delimited file should be extremely simple. And many times it is if you run under a pure UNIX environment. But if you are under Windows / DOS (using Cygwin) it can be a little tricky, but there is a simple solution.

To convert a pipe delimited file, simply type:
cat FILE | tr "|" '\t' 

The trick here is that the first argument to the tr command has to use double-quotes. The standard single quotes will not work under DOS if the character you are converting is a pipe.

Nifty,
/T