Skip to main content

Fun with grep and sed

At work we have several java files that have javadocs with links that are not hyperlinked with . So I wanted to covert the links to hyperlinks. We wanted to convert only links that start with "Automates ", followed by one or more links that ends with a number. Example "Automates http://something/12345 and http://something/67890 but not http://something/54321". I wanted to do the conversion with one line of a bash command (trying to avoid writing the bash script). While tackling the problem I learnt a few things that I want to share and record here for myself to look back again in future when I forget.

To start with I needed to find all the files containing "Automates http://". I just wanted the filenames containing that string. And so comes grep to the rescue. With -l switch to list just the filenames instead of all the lines that match.

grep -R -l "Automates http://" *

Then it is time to replace the links with <a href=link>link</a> only for those lines containing "Automated http://". For this I want the line number of every line of every file that contained it. Getting the file number and line number is easy with grep. To get the filename use -H switch and to get the line number use -n switch. Here is an example

grep -R -H -n "Automates http://" *

The output of the above command looks something like this

/home/chandanp/temp/temp.java:73:  /** Automates http://something/353571 */

To replace the link with hyperlink we can used sed. All we need is the filename, the line number and the string to replace. And use sed like so

sed -i '936s|\(http.*[0-9]\)|<a href="\1">\1</a>|g' Filename.java

Where 936 is the line number I want to change and Filename.java is the filename I want to edit. The -i option edits the file in place. The more complicated part is the regex matching. Basically anything that matches the regex inside a \( and \) will be stored in a buffer. The buffer number is the number of the matching \(\). So in the example above, the first buffer is the string that matches \(http.*[0-9]\). Which is basically any link that ends with a number. To recall the buffer we use \1. Which means: use the value that matches the first parenthesis pair. So in the sed the replaced string will be <a href="link">link</a>, where link is the string that matches \(http.*[0-9]\). Here is an example of the change


/** Automates <a href="http://something/353571">http://something/353571</a> */

Notice another thing with the way I used sed's replace command above. I used s|match|replace instead of the usual s/match/replace. What many people don't know is that once can use any character after s instead of the usual /. So you could even do s#match#replace too if you want. I used the pipe symbol.

Now that we can replace each individual line of each file we somehow have combine the previous grep output with this sed command. That was tricky. First we need to break up the output of the grep command to individual filename and line numbers and then give that to sed. Well xargs, cut and sed to the rescue. We use the fact that the filename and line number are delimited by : and play some tricks

grep -R -l "Automates http://" * | xargs -I{} grep -H -n "Automates http://" {} | cut -f-2 -d: | sed "s/\(.*\):\(.*\)/filename is \1 and line number is \2/"

Basically all it says is that take the output from the first grep which prints out the filename containing "Automates http://" and pipe it to xargs which takes the filename and gives it to another grep that prints filename:line_number:matched_sting and pipe that information to cut which prints the first 2 tokens that are delimited by :. We need to do the cut because the matched string also has : which means we don't want sed to use that part of information in the matching. Then we pipe the information from the cut to another sed to print the filename and line number. Here is the output after various pipes

$ grep -R -l "Automates http://" *
temp.java

$ grep -R -l "Automates http://" * | xargs -I{} grep -H -n "Automates http://" {}
temp.java:73:  /** Automates http://something/353571 */
temp.java:936:  /** Automates http://something/336439 and http://something/336438 */

$ grep -R -l "Automates http://" * | xargs -I{} grep -H -n "Automates http://" {} | cut -f-2 -d:
temp.java:73
temp.java:936

$ grep -R -l "Automates http://" * | xargs -I{} grep -H -n "Automates http://" {} | cut -f-2 -d: | sed "s/\(.*\):\(.*\)/filename is \1 and line number is \2/"
filename is temp.java and line number is 73
filename is temp.java and line number is 936

The final piece of puzzle is to make output from the last sed into a command and then run it. So instead an output like filename is temp.java and line number is 73, we just need sed -i '73s|\(http.*[0-9]\)|<a href="\1">\1</a>|g' temp.java. So here is the command to do just that (very complicated with lots of backslashes and quotes but I did not know any better :).

$ grep -R -l "Automates http://" * | xargs -I{} grep -H -n "Automates http://" {} | cut -f-2 -d: | sed "s/\(.*\):\(.*\)/sed -i \\\'\2s|\"\\\(\"http.*[0-9]\"\\\)\"|\<a href=\"\\\1\"\>\"\\\1\"\<\/a\>|\\\' \1/"
sed -i \'73s|"\("http.*[0-9]"\)"|<a href="\1">"\1"</a>|\' temp.java
sed -i \'936s|"\("http.*[0-9]"\)"|<a href="\1">"\1"</a>|\' temp.java

Then we need to execute that command using bash. Like so

grep -R -l "Automates http://" * | xargs -I{} grep -H -n "Automates http://" {} | cut -f-2 -d: | sed "s/\(.*\):\(.*\)/sed -i \\\'\2s|\"\\\(\"http.*[0-9]\"\\\)\"|\<a href=\"\\\1\"\>\"\\\1\"\<\/a\>|\\\' \1/" | xargs -I{} bash -v -c "{}"

Ah finally. But there is one problem however. When there are multiple links in the same line, sed matches all of the links and creates a weird output like this:

Automates <a href=http://something/336439 and http://something/336438>http://something/336439 and http://something/336438</a>

I still don't have good solution for that. Since I have just a few of these lines I fixed them quickly using tkdiff. But anyone know how to solve it?

Comments

Johan said…
Hello!

This was just what I was looking for. I could of course have done it "manually", by opening each of my 10 files and changing the line there... but I find this solution to be far more elegant.

Tnx.

BTW... to be able to let bash execute something you can echo, I usually pipe it to sh.

This is what I ended up with:

grep -R -l "load(\[wdir 'parameters" * | xargs -I{} grep -H -n "load(\[wdir 'parameters" {} | cut -f-2 -d: | sed "s/\(.*\):\(.*\)/sed -i \'\2s|load(\\\[wdir\\\ parameters|parameters=load(\\\[wdir\\\ parameters|\' \1/"|sh
Re-ynd said…
Good tip! Thank you. Will remember that for the next time :)

Popular posts from this blog

Attesting General Power of Attorney in SF

Recently I had to go through the motions of getting a General Power of Attorney (GPA) document attested in San Francisco. I am an Indian by birth. My parents were trying to buy a house back in India for me. Since I did not want to travel to India they needed a GPA so that they can act on my behalf to sign all the documents required to buy the house. The problem however is that they needed it urgently because the seller lives in UK and wants to get all the things done quickly so he can go back. My parents send me a GPA document that they obtained from a lawyer. This is a document that will give the power to my parents to buy the said property in the document on my behalf. The lawyer said that I will have to get the document attested at an Indian Consulate in USA. The closest one for me is in SF and I can drive there in about an hour from where I live. So I though it will be like a day's work to get all the things done. I looked up at their  website  for the procedure to att

XBMC / Boxee remote control android app

I have been writing a few android apps over weekends at home and during 20% time at Google. However I never actually released any of them in the android market mainly because they were quick and dirty apps that fit my needs but perhaps would not be appealing to the general public. One such app that I quickly wrote over a couple of weekends is a XBMC remote. The media center that I use at home is XBMC and I have always wanted to have more control and faster access to my media. Using my remote to navigate through the menus is not as fast. Especially when I wanted to queue a lot of music it is very slow. So I wrote this nice little app called "XBMC remote" for my android phone to control XBMC from anywhere :). Give it a try. Search for "xbmc" in android market and install it if you use XBMC as your media center. When you first launch the app you will start with this screen. You will have to setup your web server address, username and password (if required) by

gtkdocize not found

If you are ever configuring an app and see the message "gtkdocize not found" in Gentoo, then you need to emerge gtk-doc. I had some hard time figuring this out so I am writing it in my blog for the next time. When I saw that error message I did an "emerge -s gtkdocize". Usually it is that simple in Gentoo. But not this time. The emerge command returned no results at all. Then I searched for gtkdoc and still no luck. After searching in Google, I still did not have a solution. After thinking for a while I decided to try to search for gtk-doc. Bingo! That worked! Interestingly, this is my first post from my Virtual machine :-)