Friday, August 20, 2010

Editing text files with sed -- adding a line in particular places

On unix and unix-like systems there is a stream editor called "sed" (which obviously stands for stream editor) which is very useful for editing large text files. It's a stream editor because it takes a text file one line at a time edits it according to your wishes, and outputs that edited line -- sorta' like a stream. I frequently use sed to make wholesale changes to text files. One could use an editor like emacs or Kedit or others to -- maybe -- make the same changes. However, the powerful and beautiful thing about sed is that it accepts regular expressions.

Suppose you have a file that contains dates like 1-Jan-2000, 15-Jul-1998, and so on and you want to replace the years with their two-digit equivalents (2000 with just 00, 1998 with 98,....). That would be a somewhat tedious task with a typical search and replace operation available in editors or word-processing programs. However, with regular expressions one can craft a symbolic expression that means "find me a number with one or two digits followed by a dash followed by three alphabets the first of which should be upper case, then a dash and finally four digits". A stream editor can then be used to replace the last four digits with just the last two digits.

Although I've been using sed for a while and consider myself quite proficient with it, I ran into an interesting problem: If a line begins with a date, I want to insert a blank line before it. While my previous experience with sed was confined to acting on each line individually this was an attempt to insert a line into the stream. Here's the command that did the trick:

sed '/^[0-9]\{1,2\}\-/i\
\
' inputfile.txt

The regular expression begins with the first forward slash and ends with the second forward slash and it means find me all lines that begin with one or two digits followed by a dash -- in the file that I was editing this was sufficient to find me all the relevant lines. The key to inserting a line is the everything after the second forward slash and ending with the '. It essentially says "insert a blank line preceding whatever matched the regular expression." Note: One has to literally type the backslash at the end of the line then hit the [Enter] key on the keyboard, then another backslash and [Enter], then the closing ' and the name of the input file.

If you regularly edit large text files making wholesale changes to them, I highly recommend sed -- it's very powerful and fast. There is a steep learning curve but once you start using it, you'll be doing all kinds of powerful edits that previously took you hours in literally seconds.

Review of "The Girl with the Dragon Tattoo" (Stieg Larsson)

Somehow ended up watching the movie on Netflix's instant streaming service before reading the book. Thoroughly enjoyed the movie. Didn't like the book. The movie doesn't leave you enough time to think about whether something makes sense or no. The book does. This is the problem. I find most mysteries to have little credibility because when you think about the progression of events they don't quite make sense. In fact, I'm going to stick out my neck and say non-fiction mysteries, i.e., real-world murder cases or scientific mysteries are way more interesting and credible than the average Agatha Christie type mystery.

Skip the book and watch the movie. And, oh, there are some really sick people in the world.