Friday, January 15, 2010

Fun with grep and sed

At work we have several java files that have javadocs with links that are not hyperlinked with . So I wanted to covert the links to hyperlinks. We wanted to convert only links that start with "Automates ", followed by one or more links that ends with a number. Example "Automates http://something/12345 and http://something/67890 but not http://something/54321". I wanted to do the conversion with one line of a bash command (trying to avoid writing the bash script). While tackling the problem I learnt a few things that I want to share and record here for myself to look back again in future when I forget.

To start with I needed to find all the files containing "Automates http://". I just wanted the filenames containing that string. And so comes grep to the rescue. With -l switch to list just the filenames instead of all the lines that match.

grep -R -l "Automates http://" *

Then it is time to replace the links with <a href=link>link</a> only for those lines containing "Automated http://". For this I want the line number of every line of every file that contained it. Getting the file number and line number is easy with grep. To get the filename use -H switch and to get the line number use -n switch. Here is an example

grep -R -H -n "Automates http://" *

The output of the above command looks something like this

/home/chandanp/temp/  /** Automates http://something/353571 */

To replace the link with hyperlink we can used sed. All we need is the filename, the line number and the string to replace. And use sed like so

sed -i '936s|\(http.*[0-9]\)|<a href="\1">\1</a>|g'

Where 936 is the line number I want to change and is the filename I want to edit. The -i option edits the file in place. The more complicated part is the regex matching. Basically anything that matches the regex inside a \( and \) will be stored in a buffer. The buffer number is the number of the matching \(\). So in the example above, the first buffer is the string that matches \(http.*[0-9]\). Which is basically any link that ends with a number. To recall the buffer we use \1. Which means: use the value that matches the first parenthesis pair. So in the sed the replaced string will be <a href="link">link</a>, where link is the string that matches \(http.*[0-9]\). Here is an example of the change

/** Automates <a href="http://something/353571">http://something/353571</a> */

Notice another thing with the way I used sed's replace command above. I used s|match|replace instead of the usual s/match/replace. What many people don't know is that once can use any character after s instead of the usual /. So you could even do s#match#replace too if you want. I used the pipe symbol.

Now that we can replace each individual line of each file we somehow have combine the previous grep output with this sed command. That was tricky. First we need to break up the output of the grep command to individual filename and line numbers and then give that to sed. Well xargs, cut and sed to the rescue. We use the fact that the filename and line number are delimited by : and play some tricks

grep -R -l "Automates http://" * | xargs -I{} grep -H -n "Automates http://" {} | cut -f-2 -d: | sed "s/\(.*\):\(.*\)/filename is \1 and line number is \2/"

Basically all it says is that take the output from the first grep which prints out the filename containing "Automates http://" and pipe it to xargs which takes the filename and gives it to another grep that prints filename:line_number:matched_sting and pipe that information to cut which prints the first 2 tokens that are delimited by :. We need to do the cut because the matched string also has : which means we don't want sed to use that part of information in the matching. Then we pipe the information from the cut to another sed to print the filename and line number. Here is the output after various pipes

$ grep -R -l "Automates http://" *

$ grep -R -l "Automates http://" * | xargs -I{} grep -H -n "Automates http://" {}  /** Automates http://something/353571 */  /** Automates http://something/336439 and http://something/336438 */

$ grep -R -l "Automates http://" * | xargs -I{} grep -H -n "Automates http://" {} | cut -f-2 -d:

$ grep -R -l "Automates http://" * | xargs -I{} grep -H -n "Automates http://" {} | cut -f-2 -d: | sed "s/\(.*\):\(.*\)/filename is \1 and line number is \2/"
filename is and line number is 73
filename is and line number is 936

The final piece of puzzle is to make output from the last sed into a command and then run it. So instead an output like filename is and line number is 73, we just need sed -i '73s|\(http.*[0-9]\)|<a href="\1">\1</a>|g' So here is the command to do just that (very complicated with lots of backslashes and quotes but I did not know any better :).

$ grep -R -l "Automates http://" * | xargs -I{} grep -H -n "Automates http://" {} | cut -f-2 -d: | sed "s/\(.*\):\(.*\)/sed -i \\\'\2s|\"\\\(\"http.*[0-9]\"\\\)\"|\<a href=\"\\\1\"\>\"\\\1\"\<\/a\>|\\\' \1/"
sed -i \'73s|"\("http.*[0-9]"\)"|<a href="\1">"\1"</a>|\'
sed -i \'936s|"\("http.*[0-9]"\)"|<a href="\1">"\1"</a>|\'

Then we need to execute that command using bash. Like so

grep -R -l "Automates http://" * | xargs -I{} grep -H -n "Automates http://" {} | cut -f-2 -d: | sed "s/\(.*\):\(.*\)/sed -i \\\'\2s|\"\\\(\"http.*[0-9]\"\\\)\"|\<a href=\"\\\1\"\>\"\\\1\"\<\/a\>|\\\' \1/" | xargs -I{} bash -v -c "{}"

Ah finally. But there is one problem however. When there are multiple links in the same line, sed matches all of the links and creates a weird output like this:

Automates <a href=http://something/336439 and http://something/336438>http://something/336439 and http://something/336438</a>

I still don't have good solution for that. Since I have just a few of these lines I fixed them quickly using tkdiff. But anyone know how to solve it?
Post a Comment