Saturday, February 2, 2008

What's the difference?

If you work with plain text files, you often have 2 almost identical text files and you need to know what the difference between them is. This is quite easy to do with the UNIX command line tool diff. It will show you all the lines that are different and tell you in which file the additional text is. The diff command has a lot of cool options, so don’t forget to do a "man diff" on it first to see all the options. I personally use -w and -b the most as they ignore all white-spaces, and white-spaces at end of line, respectively.

Question:
What is the difference between 2 text files, ignoring all white-spaces?
Answer:
diff -w 1.txt 2.txt

However, sometimes you do not want to get all the differences, but merely know which text string occurs in one file that does not occur in the other.

Question:
You have 2 text files, and want to see which text strings are unique in column 1 in either file.
Answer:
cut -f1 1.txt | sort -u > 1u.txt
cut -f1 2.txt | sort -u > 2u.txt
cat 1u.txt 2u.txt | sort | uniq -u

Explained:
Cut column 1 from file 1 | remove duplicates | save in temporary file. Repeat for file 2. Then concatenate the 2 temp files with the unique entries | sort the content | show only unique entries as duplicates indicate that the line was present in both files.

If you wanted to show which lines were in both files instead, simply change the option for uniq to -d (to show only the duplicates). cat 1u.txt 2u.txt | sort | uniq -d

Note that this solution does not show you which file the unique string came from. Also, cat is like the DOS command type, though much cooler. It concatenates the content of all files given as argument as sends that to stdout (as most UNIX commands do).

Nifty.

No comments: