Regex in VIM.. simple

There are more than a gazillion ways to use regexs. I am sure they are each very useful for their own subset of problems. The sheer variety can be highly confusing and scary for a lot of people though, and you only need to use a few approaches to accomplish most text-editing tasks.

Here is a simple method for using regex in the powerful text editor VIM that will work well for common use.

Method

We are going to take the “search and delete a word” problem for an example. We want to delete all instances of the singular noun “needle” in a text file. Let’s assume there are no instances of the pluralisation “needles” in our document.

  1. Debug on.. turn some VIM options on
    :set hlsearch
    :set wrapscan
    – this will make all regex expressions possible to debug by visually showing what they match in your document (first line) and make all searches wrap around instead of just search forward from your current position, which is the default. (second line)
  2. Develop and Test.. your regex attempts by using a simple search. Here we see three attempts at solving the problem: :/needl
    :/needle
    :/<needle>
    – our third try is correct, and highlights all words that spell “needle”. The < and > markers allow you to specify the beginning and the end of a word. Play with different regexs using the simple search and watching what is highlighted, until you discover one that works for you.

  3. Run… your regex:%s/<needle>//g – once you’ve figured out a regex, run the regex on your document. This example will execute a search for the word “needle” and delete every one. If you wanted to substitute needle for another word, you would put the word in between the // marks. As we can see, there is nothing between the marks in this example, so it will replace instances of “needle” with nothing. This means it will serve to delete every instance of the word “needle”.
  4. Check things are OK… with your document :/<needle>
    :/needle
    :/needl
    – has the regex done what you want? Use the search function to see if regex has done what you wanted it to do. The above examples show different searches through the document to see if different variations remain. Any matches of these searches will highlight any problems. You can use the lower-case N(next search result) and lower-case P(previous search result) commands to navigate through any found search results. You must remember to manually look through the document and see what the regex has changed, make sure there aren’t any unwanted surprises!
  5. Recover… from any mistakes u – just press the U key (with no capslock or shift). This will undo the very last last change made to the document.
  6. Redo… any work that you need to <ctrl>-r – use the redo fuction; press the CONTROL and R keys together (with no capslock or shift). This will redo the last change made to the document.
  7. Finish up and Write… to file :w – write your work on the document to file. Even after you have written out to file, you can probably still use the undo function to get back to where you were, but it’s best practice to not rely on this, and only write once you’re done.
  8. Debug off.. turn some options off
    :set nohlsearch
    :set nowrapscan
    – turn off the regular expression highlighting (line 1). turn off the wraparound searching (line 2). You can leave either or both options on if you want, they’re often useful. Up to you.

Use a combination of these wonderful commands to test and improve your regex development skills in VIM.

Examples

Here I use the shorthand “#…” to denote comments on what I’m doing… if you want to copy and paste the example as written, then you will have to remove those comments.

1. Remove ancient uppercase <BLINK> tags from a document.

:set wrapscan # debug on
:set hlsearch # debug on
:/<BLINK> # try 1.. bingo! first time.. selected all tags I want
:%s/<BLINK>//g # lets execute my regex remove
:/BLINK # check 1.. testing things are OK in my file by searching through..
:/blinked # check 2.. yep thats ok..
:/<BLINK> # check 3.. yep looks ok... the problem tags are gone
# ...manual scroll through the document.. looks good!
:w # write out to file
:set nohlsearch # debug off
:set nowrapscan # debug off

2. Oh no! We missed some lower and mixedcase <bLiNK> tags that some sneaky person slipped in. Let’s take them out.

:set wrapscan # debug on
:set hlsearch # debug on
:/<blink> # try 1.. hm.. worked for many, but didnt match BlInK or blINK mixedcase
:/<blink>/i # try 2.. much better.. seems to have worked!
:%s/<blink>//i # lets execute my regex remove
:/BLINK # check 1.. testing things are OK in my file by searching through..
:/blinked # check 2.. yep thats ok..
:/<blink> # check 3.. yep thats fine.
:/<blink>/i # check 4.. looks good... problem solved
# ...manual scroll through the document.. looks much better!
:w # write out to file
:set nohlsearch # debug off
:set nowrapscan # debug off

3. Replacing uppercase or mixedcase <BR> tags with the more modern <br>.

:set wrapscan # debug on
:set hlsearch # debug on
:/<BR> # try 1.. hmm.. just uppercase.. not gonna work..
:/<br> # try 2.. hmm.. just lowercase..
:/<BR>/i # try 3.. ahh.. that'll be it then
:%s/<BR>/<br>/gi # lets execute my regex substitution
:/BR # check 1.. testing things are OK in my file by searching through..
:/br # check 2.. yep thats ok..
/bR # check 3 ..yup..
:/<BR>/i # check 4.. yep looks ok... the problem tags seem to be gone
# ...manual scroll through the document.. looks good!
:w # write out to file
:set nohlsearch # debug off
:set nowrapscan # debug off

For More..

Regexs are the gift that just keeps on giving. Here are some good resources on regexs in general, and regexs in VIM.

VirutalHosts on CentOS

A common task when setting up an Apache webserver under Linux, is writing a httpd.conf file. The httpd.conf file is the main configuration file for Apache. One of the main reasons to edit the httpd.conf file is to setup virtual hosts In Apache. A Virtual host configuration allows several different domains to be run off a single instance of Apache, on a single IP. Each host is a ‘Virtual host’, and typically has a different web root, log file, and any number of subdomains aliased to it. The virtualhosts are configured in parts of the httpd.conf file that look like this:


    ServerName myserver.co.uk
    ServerAlias www.myserver.co.uk
    ServerAdmin me@myserver.co.uk
    DocumentRoot /var/www/html/myserver.co.uk
    ErrorLog logs/myserver.co.uk-error_log
    CustomLog logs/myserver.co.uk-access_log common

Now on Ubuntu, virutalhosts are made easy. The httpd.conf is split into several files. Each virutalhost has a different file in /etc/apache2/sites-available. When you want to activate a particular vitualhost, you create a symbolic link from /etc/apache2/sites-enabled/mysiteto /etc/apache2/sites-available/mysite (if you wanted to call your site configuration file ‘mysite’). When apache boots up, it loads all the files it can find in /etc/apache2/sites-available/* and that determines which virutalhosts it loads. If there is not a link from /etc/apache2/sites-available/ to your virutalhost file, it won’t load it. So you can easily remove the links in /etc/apache2/sites-available without deleting the actual virutalhost file. Therefore you can easily toggle which virtualhosts get loaded.

CentOS uses a different structure. Everything is lumped into /etc/apache/httpd.conf. So there is no way to easily toggle virutalhosts on/off, and everything is a bit more chaotic. I’ve just had to setup a new CentOS webserver, and I struggled for a bit after being used to ubuntu-server. Here’s a format you can use if you’re in the same boat, and you have to setup httpd.conf files for CentOS:

NameVirtualHost *:80 # this is eseential for for name-based switching

# an example of a simple VirtualHost that serves data from 
# /var/www/html/myserver.co.uk to anyone who types in
# www.myserver.co.uk to the browser

  ServerName myserver.co.uk
    ServerAlias www.myserver.co.uk
    ServerAdmin me@myserver.co.uk
    DocumentRoot /var/www/html/myserver.co.uk
    ErrorLog logs/myserver.co.uk-error_log
    CustomLog logs/myserver.co.uk-access_log common 


# an example of a VirutalHost with apache overrides allowed, this means you can use
# .htaccess files in the servers web root to change your config dynamically

    ServerName bobserver.co.uk
    ServerAlias www.bobserver.co.uk
    ServerAdmin me@bobserver.co.uk
    DocumentRoot /var/www/html/bobserver.co.uk
    ErrorLog logs/bobserver.co.uk-error_log
    CustomLog logs/bobserver.co.uk-access_log common
 
  
      AllowOverride All
  
  
      AllowOverride AuthConfig
      Order Allow,Deny
      Allow from All
  


# an example of a VirutalHost with apache overrides allowed, and two subdomains
# (mail and www) that both point to the same web root

  ServerName fredserver.co.uk
    ServerAlias www.fredserver.co.uk
    ServerAlias mail.fredserver.co.uk
    ServerAdmin me@fredserver.co.uk
    DocumentRoot /var/www/html/fredserver.co.uk
    ErrorLog logs/fedserver.co.uk-error_log
    CustomLog logs/fredserver.co.uk-access_log common
 
  
      AllowOverride All
  
  
      AllowOverride AuthConfig
      Order Allow,Deny
      Allow from All
  


# .. etc

With the above structure, you can add as many VirutalHosts to your configuration as you have memory to support (typically dozens). Apache will decide on which to choose based on the ‘ServerName’ specified in each VirtualHost section. Just remember to add that all-important NameVirtualHost: *:80 in the beginning.

Once you’ve got your httpd.conf file the way you like it, be sure to test it before you restart apache. If you restart apache and your httpd.conf file has errors in it, Apache will abort the load process. This means that all the websites on your webserver will fail to load. I always use apachectl -t or apache2ctl -t before I restart. That will parse the httpd.conf file and check the syntax. Once that’s OK, then you can issue a /etc/init.d/httpd restart to restart Apache.

Scraping Wikipedia Information for music artists, Part 2

I’ve abandoned the previous Wikipedia scraping approach for Brightonsound.com, as it was unreliable and didn’t pinpoint the right Wikipedia entry – ie: a band called ‘Horses’ would pull up a Wikipedia bio on the animal – which doesn’t look very professional. So instead, I have used the Musicbrainz API to retrieve some information on the artist; the homepage URL, the correct Wikipedia entry, and any genres/terms the artist has been tagged with.

It would be simple to extend this to fetch the actual bio from a site like DBpedia.org (which provides XML-tagged Wikipedia data), now that you always have the correct Wikipedia page reference to fetch the data from.

(You will need to download the Musicbrainz python library to use this code):

import time
import sys
import logging
from musicbrainz2.webservice import Query, ArtistFilter, WebServiceError
import musicbrainz2.webservice as ws
import musicbrainz2.model as m

class scrapewiki2(object):

  def __init__(self):
    pass

  def getbio(self,artist):

    time.sleep(2)
    art = artist
    logging.basicConfig()
    logger = logging.getLogger()
    logger.setLevel(logging.DEBUG)

    q = Query()

    try:
      # Search for all artists matching the given name. Limit the results
      # to the 5 best matches. The offset parameter could be used to page
      # through the results.
      #
      f = ArtistFilter(name=art, limit=1)
      artistResults = q.getArtists(f)
    except WebServiceError, e:
      print 'Error:', e
      sys.exit(1)


    # No error occurred, so display the results of the search. It consists of
    # ArtistResult objects, where each contains an artist.
    #

    if not artistResults:
      print "WIKI SCRAPE - Couldn't find a single match!"
      return ''

    for result in artistResults:
      artist = result.artist
      print "Score     :", result.score
      print "Id        :", artist.id
      try:
        print "Name      :", artist.name.encode('ascii')
      except Exception, e:
      print 'Error:', e
      sys.exit(1)

    print "Id         :", artist.id
    print "Name       :", artist.name
    print

    #
    # Get the artist's relations to URLs (m.Relation.TO_URL) having the relation
    # type 'http://musicbrainz.org/ns/rel-1.0#Wikipedia'. Note that there could
    # be more than one relation per type. We just print the first one.
    #
    wiki = ''
    urls = artist.getRelationTargets(m.Relation.TO_URL, m.NS_REL_1+'Wikipedia')
    if len(urls) > 0:
      print 'Wikipedia:', urls[0]
      wiki = urls[0]

    #
    # List discography pages for an artist.
    #
    disco = ''
    for rel in artist.getRelations(m.Relation.TO_URL, m.NS_REL_1+'Discography'):
      disco = rel.targetId
      print disco

    try:
      # The result should include all official albums.
      #
      inc = ws.ArtistIncludes(
        releases=(m.Release.TYPE_OFFICIAL, m.Release.TYPE_ALBUM),
        tags=True)
      artist = q.getArtistById(artist.id, inc)
    except ws.WebServiceError, e:
      print 'Error:', e
      sys.exit(1)

    tags = artist.tags

    toret = ''
    if(wiki):
      toret = ''+art+' Wikipedia Articlen'
    if(disco):
      toret = toret + ''+art+' Main Siten'
    if(tags):
      toret = toret + '
Tags: '+(','.join(t.value for t in tags))+'n' return toret sw2 = scrapewiki2() # unit test print sw2.getbio('Blur') print sw2.getbio('fatboy slim')

PS:
Apologies to the person that left several comments on the previous wikipedia scraping post, I have disabled comments temporarily for now due to heavy amounts of spam, but you can contact me using the following address: david@paul@craddock@googlemail.com (subsitute first two @s for ‘.’s ). I also hope this post answers your question.

Scraping artists bios off of Wikipedia

I’ve been hacking away at BrightonSound.com and I’ve been looking for a way of automatically sourcing biographical information from artists, so that visitors are presented with more information on the event.

The Songbird media player plugin ‘mashTape’ draws upon a number of web services to grab artist bio, event listings, youtube vidoes and flickr pictures of the currently playing artist. I was reading through the mashTape code, and then found this posting by its developer, which helpfully provided the exact method I needed.

I then hacked up two versions of the code, a PHP version using simpleXML:

Result->Url);
    if($ar[2] == 'en.wikipedia.org'){
      $wikikey = $ar[4]; // more than likely to be the wikipedia page
    }else{
      return ""; // nothing on wikipediea
    }
    $url = "http://dbpedia.org/data/$wikikey";
    $x = file_get_contents($url);
    $s = new SimpleXMLElement($x);
    $b = $s->xpath("//p:abstract[@xml:lang='en']");
     return $b[0];
 }
?>

and a pythonic version using the amara XML library (has to be installed seperately):

import amara
import urllib2
from urllib import urlencode

def getwikikey(band):
  url = "http://api.search.yahoo.com/WebSearchService/V1/webSearch?appid=YahooDemo&query=%22"+band+"%22&site=wikipedia.org";
  print url
  c=urllib2.urlopen(url)
  f=c.read()
  doc = amara.parse(f)
  url = str(doc.ResultSet.Result[0].Url)
  return url.split('/')[4]

def uurlencode(text):
   """single URL-encode a given 'text'.  Do not return the 'variablename=' portion."""
   blah = urlencode({'u':text})
   blah = blah[2:]
   return blah

def getwikibio(key):
  url = "http://dbpedia.org/data/"+str(key);
  print url
  try:
    c=urllib2.urlopen(url)
    f=c.read()
  except Exception, e:
    return ''
  doc = amara.parse(f)
  b = doc.xml_xpath("//p:abstract[@xml:lang='en']")
  try:
    r = str(b[0])
  except Exception, e:
    return ''
  return r

def scrapewiki(band):
  try:
    key = getwikikey(uurlencode(band))
  except Exception, e:
    return ''
  return getwikibio(key)

  #unit test
  #print scrapewiki('guns n bombs')
  #print scrapewiki('diana ross')

There we go, artist bio scraping from wikipedia.

A poor man’s VMWare Workstation: VMWare Server under Ubuntu 7.10 + VMWare Player under Windows XP

I finally setup my Dell Lattitude D630 laptop the way I wanted it last night, and thought I’d do a quick writeup about it. Here is the parttition table:

  1. A 40GB Windows XP partition, with VMWare Player installed, which I will be using for Windows applications that don’t play well in virtualised mode (eg media applications). I will also be using it as the main platform for running VMs.
  2. A basic 5GB root + 1.4GB swap 7.10 Ubuntu server partition, with VMWare Server installed (for creating, advanced editing and performing network testing on VMs). I used these VMWare server on Ubuntu 7.10 tutorials.
  3. A 36GB NTFS partition for storing VMs
  4. A 26GB NTFS media partition for media I want to share between VMs and the two operating systems on the disc.

We use VMWare servers at work to host our infrastructure, so this setup will be very useful for me. I can now:

  1. Take images off the servers at work and bring them up, edit them and test their network interactions under my local VMWare Server running on my Linux install.
  2. From within my windows install, I can bring up a Linux VM and use Windows and Linux side by side.

My Nabaztag

Nabaztag (Armenian for “rabbit”) is a Wi-Fi enabled rabbit, manufactured by Violet. The Nabaztag is a “smart object”; it can connect to the Internet (for example to download weather forecasts, read its owner’s email, etc). It is also fully customizable and programmable.Wikipedia.org

Here is our Nabaztag – Francois Xavier:

Meet Francois

Of course, I’ve been messing around with poor old Francois’s programming..

With the help of OpenNab, a proxy server that masquerades as an official Nabaztag server, you can make your Nabaztag do all kinds of tricks. At the moment I’m getting him to read out what’s currently showing on TV when someone presses his button.

Francois in the Surgery

Here’s the technical details:

Whenever the button is pressed on Francois, he sends a message destined for the official Nabaztag server, that is caught by the proxy server. The proxy server then executes a PHP script. This PHP script grabs the current TV listings from a RSS feed, composes them into a readable list, and then sends the list to the official Nabaztag server, which converts the list (using a text-to-speech synthesis program) into audio files, and streams those audio files to Francois.

The result is a rabbit that reads the TV listings. A useful addition to our TV room.

Coming soon: More technical information on how you can do this yourself.