Monday, February 18, 2008

Nifty Regular Expressions (RegExp)

I personally use TextPad as my default text editor. There are probably some much fancier ones out there, but TextPad has all the features that I demand, like great regExp support, vertical cut/paste, and a few other nifty features. It is free to download, and only something like $15 to buy. Well worth it. After all, that is not much more than that venti Caramel Macchiato that you normally order. And TextPad has great help on regular expression. I strongly encourage you to read their regExp help. And by the way, if you try TextPad, then I would suggest that you first change these 2 settings:
1) Under Preference/General, set Context Menu, so you can quickly send any file to TextPad
2
) Under Preference/Editor, set Microsoft compatible.
3) Under Preference/View, set line numbers.
4) Under Preference/Assoc Files, add any file type you wish to open with TextPad (e.g. .txt)

OK, back to our scheduled programming.
Below are a few tricks I've come to appreciate with regular expression. I do expect the reader to have basic familiarity with regexp (^ means beginning of line, $ means end of line, . means any character, etc)

Problem: Remove all lines containing a certain text string.
Find: ^.*FINDME.*\n
Replace:
Explained: Locate the string you want, and select the entire line (including the newline character). Simply replace with nothing. Note that if you chose $ instead of \n, then you would end up with a lot of blank lines instead, but assuming that you want to completely remove these lines, you need to use \n at the end.

Problem: Insert line numbers in front of all lines
Find: ^
Replace: \i(100,10)\t
Explained: Find beginning of line, replace with a numeric counter starting at 100 and incrementing by 10 followed by a tab. So now you'll have a first column with numbers 100,110,120, etc. in it. By the way, \i by default starts at 1 and increments by 1.

Problem: You have a text files with dates in a DDMMYYYY (day, month, year) format in column 1 and you would like to quickly convert them over to an YYYYMMDD format. This is quickly done using regular expression.
Find: ^\(..\)\(..\)\(....\)
Replace: \3\2\1
Explained: Create 3 match sets of 2, 2, and 4 characters respectively from the beginning of the line. Simply put the 3 match sets in the desired order. If you are unfamiliar with match sets, then \( and \) define each set and \1 refers to the first set, \2 to the second, etc. Please note that if you use POSIX style regexp then you do not need to escape the parentheses (i.e. use (..) instead of \(..\) ) to create the match sets.

Nifty

Saturday, February 2, 2008

Searching XSLT files

I work a lot with XML and XSLT files, and hence I often need to grep files (i.e. search for text strings in files) for certain term. Windows Explorer does a great job for most file types, but unfortunately the Windows Explorer search function does not include any XSLT files when searching. Just try to create a dummy file called a.xslt with a dummy text string FINDME, and try to locate it using Windows Explorer search. You will not find it. Sad, but true. If someone knows of a simple tweak (e.g. registry change) to change this behavior, then please post it in the comment section of this entry. I'm not sure why XSLT files are excluded. Afterall, JavaScript (.js) and C# (.cs) files are included. Maybe XSLT just made the Microsoft black list since it is not a Microsoft standard.

Assuming for now that we cannot change the Windows Explorer search behavior, I wanted to show you how to search all files including XSLT files, using UNIX command tools. If you just need to search the current folder, you can of course you use
grep FINDME *.xslt

And if you have a fixed 3 level folder structure you may do something like this
grep FINDME *\*\*.xslt
but note that this will ONLY look for files that are exactly 3 directories deep (e.g. it will not search the file named a.xslt in the root folder).

So we need a much more flexible solution that can search files located at any directory level. But let me first introduce yet another UNIX command line tool: find. The command find, recursively lists all files and folders matching a given criterion (e.g. name, create date, etc) and may execute a command on each file found. Again, do a "man find" to get an overview of the options.

List all files recursively:

find . -type f


List all files or directories with the string xslt in them:

find . | grep -i xslt


Find any file that contains the text FINDME
find . -type f -exec grep FINDME "{}" ";"

Note that when using -exec you also need to add those special "{}" ";" characters at the end. Don't want to go into great detail here, but you need to always add it. Otherwise you can basically specify any command you want after exec, like we did with grep in the above case.

Nifty.

What's the difference?

If you work with plain text files, you often have 2 almost identical text files and you need to know what the difference between them is. This is quite easy to do with the UNIX command line tool diff. It will show you all the lines that are different and tell you in which file the additional text is. The diff command has a lot of cool options, so don’t forget to do a "man diff" on it first to see all the options. I personally use -w and -b the most as they ignore all white-spaces, and white-spaces at end of line, respectively.

Question:
What is the difference between 2 text files, ignoring all white-spaces?
Answer:
diff -w 1.txt 2.txt

However, sometimes you do not want to get all the differences, but merely know which text string occurs in one file that does not occur in the other.

Question:
You have 2 text files, and want to see which text strings are unique in column 1 in either file.
Answer:
cut -f1 1.txt | sort -u > 1u.txt
cut -f1 2.txt | sort -u > 2u.txt
cat 1u.txt 2u.txt | sort | uniq -u

Explained:
Cut column 1 from file 1 | remove duplicates | save in temporary file. Repeat for file 2. Then concatenate the 2 temp files with the unique entries | sort the content | show only unique entries as duplicates indicate that the line was present in both files.

If you wanted to show which lines were in both files instead, simply change the option for uniq to -d (to show only the duplicates). cat 1u.txt 2u.txt | sort | uniq -d

Note that this solution does not show you which file the unique string came from. Also, cat is like the DOS command type, though much cooler. It concatenates the content of all files given as argument as sends that to stdout (as most UNIX commands do).

Nifty.