Thursday, January 17, 2008

Command line tools for dummies: Stringing them together

In the last blog entry we introduced a few basic UNIX command line tools. Now we'll start stringing them together to make them really useful.

We'll use the following 3-column tab delimited file containing:
line1 A Mike
line2 B Andy
line3 D Chang
line4 A Tom
line5 B Mike
line6 C Brad

All UNIX command line tool read from stdin (standard in) and output to stdout (standard out) by default. This allows us to string these commands together so the output of one command is used for input for the next command. On both UNIX and Windows, you use the pipe character (aka vertical bar, aka |) to string commands together.

So "grep Mike file.txt | wc –l" will first find lines with Mike in the file named file.txt and then send those lines (2 of them in our case) on to the wc command, which in turn will count them and output the result (2 in our case). Let’s try a few quick examples.

Question:
How many different grades (column 2) did we hand out?
Answer:
cut –f 2 | file.txt | sort -u | wc –l
4
Explained:
Cut field 2 | sort them uniquely (-u removes duplicates) | count how many. An alternative solution is of course: "
cut –f 2 file.txt | sort | uniq | wc –l", but "sort -u" is shorter than "sort uniq".

Question:
What was the distribution of grades (column 2)
Answer:
cut –f 2 file.txt | sort | uniq -c
2 A
2 B
1 C
1 D
Explained:
Cut field 2 | sort them | unique them (remove duplicates) | add a count (note that uniq expects sorted input).


Question:
Same question as the above, but if you have a rather large file with many values, then your distribution will get very long. Often you may just want to see the top 2.
Answer:
cut –f 2 file.txt | sort | uniq -c | sort -rn | head -2
2 A
2 B

Explained:
Cut field 2 | sort them | unique them (remove duplicates) | sort results in reverse and treat as numeric (since we want to sort the added count) | show first 2.

Question:
What grades (column 2) did Mike get?
Answer:
cut –f 2,3 file.txt | grep Mike | cut –f 1
A
B
Explained:
Cut field 2 and 3 (otherwise we could potentially find Mike elsewhere) | find only lines containing Mike | cut out the grade (now in field 1 of the 2 fields we have left).


In the next blog, we'll look at handling multiple input files. But for now, practice with the above.

Nifty.

Wednesday, January 16, 2008

Command line tools for dummies

Let me start by saying that I receive no kickbacks or any other compensation from any of the products that I mention. I would certainly like to, but I don't :-) So the products mentioned are simply products that I have come to love and simply can't live without in my daily life.

Back in the days at the school of engineering at Santa Clara University, I took a great course in UNIX command line tools (grep, awk, sed, etc), and ever since I have not been able to live without them. I mean how do you quickly view the last few lines of a very large log file without a cool command line tool like tail, or how do you quickly find which XLST file that contains the XML tag named recordIdentifier without using a tool like grep.



So when my career took a turn from a UNIX environment to a Windows environment, I was scrambling to find some cool command line tools for Windows. I found my comfort in the MKS Toolkit, which are all the beloved UNIX command line tools compiled for Windows. It is not the cheapest tool ($479), but it is worth every penny if you for example do a lot of text file handling like me. It comes with hundreds of UNIX command line tools, shells, etc.

So the next few blog entries will be dedicated to learning a few nifty tricks on how to use UNIX command line tools. It is not going to be extremely advanced, but it will solve a lot of the everyday problems you encounter. Warning: Once you learn these little tricks, your co-workers with regard you as a complete nerd or even a hack :-) If you are working on a UNIX platform, the tools are of course readily available at the prompt. If you are on Windows, go purchase a set of command line tools (like the MKS Toolkit I mentioned). If you work on a Mac, you probably haven't heard the term command line before :-)

Anyway, let's start out with the very basics by first introducing a few simple tools. The power of command line tools is in stringing them together, but before we get there, we need to understand what each tool does. Please note that you can always access the online help for each tool by typing "man toolname". These online help pages (aka man pages) contain descriptions, available options, and examples, and they are a great source for exploring the tools even further.

For all the examples below, assume that you have a very simple tab delimited text file named file.txt with 3 columns. An easy way to create it is to create it in Excel and then paste it into your favorite text editor. The file content is shown below:
line1 A Mike
line2 C Chang
line3 A Tom
line4 B Mike

OK, here we go with the top 7 command line tools I use the most. Each tool has a large number of options, but the examples just shows one of them.

grep: Find lines containing or not containing text string

-i: option to ignore case (i.e. case insensitive)
C:\>grep -i mike file.txt
line1 A Mike

line4 B Mike

wc (word count): Show number of lines, word, etc
-l: option to get only the line count
C:\>wc -l file.txt
4 file.txt

tail: Show only last few lines.
head: Show only first few lines.
-10: you can specify how many lines you want
C:\>tail -2 file.txt
line3 A Tom
line4 B Mike
C:\>head -1 file.txt
line1 A Mike

cut: Cut file vertically by characters or fields
-f: option to specify fields to cut
C:\>cut -f1,3 file.txt
line1 Mike
line2 Chang
line3 Tom
line4 Mike

uniq: show unique or repeated lines
-c: option to count number of times a line occurs
C:\>uniq -c file.txt
1 line1 A Mike
1 line2 C Chang
1 line3 A Tom
1 line4 B Mike

sort: Sort input lines
-k: specify which fields to sort on
C:\>sort -k2 file.txt
line1 A Mike
line3 A Tom
line4 B Mike
line2 C Chang

Ok, that was the very very basic stuff. Not terribly useful when you only use one command. The real nifty stuff we'll look at in the next blog entry.

Nifty

Wednesday, January 2, 2008

Copy your AutoCorrect list

As you may have read in the latest blog, AutoCorrect is a very useful feature in most Microsoft Office applications.

Once you have created your own personal list of favorite words to expand you probably want to bring a copy of that AutoCorrect list with you to all the other computers that you use. This can easily be accomplished a few different ways. There are several comprehensive tools out there, including some Microsoft tools, but I mostly only need to copy the AutoCorrect list file, so I do it the old fashioned way. I copy the physical file.

The AutoCorrect list file has an extension of ACL (AutoCorrect List). The file is named MSOxxxx.acl where xxxx designates the language that the file is for. There is an ACL file for each language you have used. For example, MSO1033.acl is the ACL file for language of English (US). Lastly, the file is typically located in your application data folders, so you'll likely find it in "C:\Documents and Settings\NN.NN\Application Data\Microsoft\Office", where Windows NN.NN is your user name.

So simply copy the file (e.g. "C:\Documents and Settings\NN.NN\Application Data\Microsoft\Office\MSO1033.acl") to the same location on your other PC and you now have the same AutoCorrect list on both.

As you will now start growing your very own AutoCorrect list, you can even share it between home, work, client assignments, etc. I personally have a USB stick (aka Pen Drive) that I always carry with me, and in my tools folder on the USB drive, I always have the latest version of my ACL file.

Nifty.