Books that have changed the way I see the practice of Software Engineering

The Mythical Man Month and other EssaysFred Brooks – Read it as part of my university degree. A lot of good sense. I can’t believe there are people out there in charge of software projects that haven’t read this book, or think it somehow doesn’t apply anymore, but unfortunately, it seems quite common. –
[Read reviews on Good Reads]

The Cathedral and the BazaarEric S Raymound – One of the first books I read on software development process. Really blew my mind. –
[Read reviews on Good Reads]

The Pragmatic Programmer – Such an eye-opener. Such an awesome book. So many good points, even today. –
[Read reviews on Good Reads]

The Phoenix Project – A really inspiring story. Made me think that organisational positive change to a more healthy collaborative style of working is both possible and achievable. Even though I’ve never personally witnessed a change of any remotely similar magnitude happen at an organisation I’ve ever worked at –
[Read reviews on Good Reads]

Clean CodeRobert C Martin – A very important book that changed the industry, and still has a large effect today. –
[Read reviews on Good Reads]

Clean CoderRobert C Martin – A book that made me take my profession more seriously, and try and become more of a responsible professional that deeply cares about software development, instead of just a fly-by-night money maker. –
[Read reviews on Good Reads]

The Software Craftsman: Professionalism, Pragmatism, Pride by Sandro Mancuso – key to putting exactly into words my nagging doubt around Agile project management that I couldn’t quite put my finger on before. –
[Read reviews on Good Reads]

Clean AgileRobert C Martin – shows how very far away from the original intention of the Agile manifesto 99% of companies implementation of Agile really is, almost to a farcical level. Even though I don’t agree with some of its points, such as the emphasis on story points, I don’t have a problem with a lot of the original intentions behind the idea. Just the overwhelmingly poor examples of its implementation that I’ve experienced. –
[Read reviews on Good Reads]

Agile Testing Condensed – answers the question so many people were asking me when I worked in QA engineering, namely, “in a continuous integration world, where do test specialists fit in?”. Well, suffice to say, they never bothered to read this book.
[Read reviews on Good Reads]

A Philosophy of Software Design – key to putting my finger on and voicing some of the concerns I had over the years in organisations that were implementing Clean Code concepts well, but still were very lacking in their code and engineering quality. Also the most well-written technical book I’ve ever read. –
[Read reviews on Good Reads]

Video Tools I use for my Media Collection

I use and recommend the following software for Windows. They’re all quite easy to use. The paid software all have lifetime subscriptions, which are worth it because you need the latest updates in this area.

Aiseesoft Blu-ray Player: (for playing Blu-Rays) (paid)

Easyfab Lossless Copy: (for ripping Blu-Rays, also works for DVDs) (paid)

VLC: (for playing ripped copies) (free)

Handbrake: (for converting between different video formats and compressing video) (free)

Plex Media Server: (for hosting video files on your network and playing them) (free)

Movavi Video Editor Plus: (for editing video files, similar interface to iMovie) (paid)

FFMPEG: (for splitting ripped video files, for TV series episodes) (free)

For using FFMPEG to split a ripped bluray track into episode segments, I look up the series on which has fairly accurate epsiode runtimes, and then for example:

 ffmpeg.exe -i '.\Samurai Jack Season 1_001.mp4' -acodec copy -vcodec copy -ss 01:11:43 -t 00:23:00 samurai_jack_s01e04.mp4
  • .\Samurai Jack Season 1_001.mp4 is the source file
  • 01:11:43 is the start time
  • 00:23:00 is the duration
  • samurai_jack_s01e04.mp4 – is the destination file. I use Plex Media Server prefered titling for TV series, e.g. this is Series 1 Episode 4. This makes it easy to import with Plex.

I then test the episode in VLC by looking at the start and end of the file to make sure the split is correct, and adjusting the start/end times and re-running as appropriate.

Using VIM as a Word Processor

It may surprise some people to learn that I don’t use a word processor to write pure text any more, and haven’t for several months. I still have a subscription to Microsoft Office 365, and probably will as long as recruiters and offices around the world still pay the MS tax.

I have created my own custom writing environment in my favourite text editor, VIM. This may sound like an absolutely foolish thing to do to anyone who knows VIM. By default, it is not setup at all to be an effective word processor. It is a programmers text editor, and a very complex and difficult to learn one at that.

Over the past 12 years or so, I have been slowly customizing it to exactly my needs. The great thing about VIM is that it is highly customizable, and you can change almost every major element of it to suit your tastes.

I use the following plugins when writing text or markdown files.

Vim Pencil

This changes the wrapping mode and a number of other VIM options to make VIM suitable as a word processor instead of just a programmers code editor.


This is a distraction free way of writing. Because of the way I have set it up, there is a lot going on in the default interface of VIM. This strips everything away to just what you see in this screenshot:

I toggle the ‘GOYO’ mode by pressing <F12>.


VIM’s default spellchecker. I use it in British English mode (of course). As soon as I type a misspelled word, it highlights it. Then I can use the arrow keys to skip between each misspelling and fix it with some keyboard shortcuts.


This picks up words which are not helpful. For example, I have a bad habit of using the word ‘actually’ too many times. Wordy will highlight this. It has a number of dictionaries of ‘useless’ words which it will check for. It is opinionated, but I find it does help improve the quality of my writing.

LanguageTool Checker

This is a Java-based command-line open-source grammar and spell checker. It is very thorough and picks up a lot of things. I don’t know how it compares to MS Word’s checking, but I find it more useful than I remember Word’s grammar checker being.

For more information on my VIM setup, check out my dotfiles repository at

Use Siri, Apple Earphones and Apple Music Together

This requires you to have a subscription to Apple Music and a 3G/4G/wireless data connection, but it is so useful! Using this tip, you can be walking along with your iPhone in your pocket and your Apple earphones in, and then change music just by pressing a button on the earphones remote control and saying ‘Play (your favourite band)’.

blog post image

1. Activate Siri and subscribe to Apple Music. I used to subscribe to Spotify so I switched to Apple Music when I realised the advantages of the integration on my phone, and the wider selection of music. You will also need to set Siri to stream music over your cellular network, this can be done in the ‘Music’ section of iOS settings.

2. Put the iPhone in your pocket with the standard Apple Earphones plugged in. With the standard Apple Earphones, there is a remote control on the cable with one button on it. To activate Siri, hold that button down for a couple of seconds until you hear the ‘Siri’ ‘bleep’.

3. Say ‘Play The Prodigy’ if you want to listen to The Prodigy, for example. This may require a couple of tries occasionally, but usually it just works.

4. Siri should look up The Prodigy on Apple Music, find their most popular songs, put them in a playlist, start streaming them and playing them through your earphones. To skip a song, press the remote control button twice in quick succession. To adjust the volume, press the edges of your remote control, the top edge to increase volume, the bottom edge to decrease. To pause all music playback, just press the remote control button once. To resume playback, press the remote control button once again.

A limitation of this is that you have to be in an area with good reception, enough to stream your tracks from Apple Music. You shouldn’t have a problem if you live in a city like Manchester, I usually get 4G across the whole city.

Scraping Gumtree Property Adverts with Python and BeautifulSoup

I am moving to Manchester soon, and so I thought I’d get an idea of the housing market there by scraping all the Manchester Gumtree property adverts into a MySQL database. Once in the database, I could do things like find the average monthly price for a 2 bedroom flat in an area, and spot bargains through using standard deviation from the mean on the price through using simple SQL queries via phpMyAdmin.

I really like the Python library BeautifulSoup for writing scrapers, there is also a Java version called JSoup. BeautifulSoup does a really good job of tolerating markup mistakes in the input data, and transforms a page into a tree structure that is easy to work with.

I chose the following layout for the program: – Stores all information about each property advert, with a ‘save’ method that inserts the data into the mysql database – Stores all the information on each listing page, which is broken down into links for specific adverts, and also the link to the next listing page in the sequence (ie: the ‘next page’ link) – When given an advert URL, this creates and populates an advert object – When given a listing URL, this creates and populates a listing object – This walks through a series of listings, calling scrapeListing and scrapeAdvert for all of them, and finishes when there are no more listings in the sequence to scrape

Here is the MySQL table I created for this project (which you will have to setup if you want to run the scraper):

-- Database: `manchester`

-- --------------------------------------------------------

-- Table structure for table `adverts`

  `url` varchar(255) NOT NULL,
  `title` text NOT NULL,
  `pricePW` int(10) unsigned NOT NULL,
  `pricePCM` int(11) NOT NULL,
  `location` text NOT NULL,
  `dateAvailable` date NOT NULL,
  `propertyType` text NOT NULL,
  `bedroomNumber` int(11) NOT NULL,
  `description` text NOT NULL,
  PRIMARY KEY (`url`)

PricePCM is price per calendar month, PricePW is price per week. Usually each advert with have one or the other specified.

import MySQLdb
import chardet
import sys

class advert:

        url = ""
        title = ""
        pricePW = 0
        pricePCM = 0
        location = ""
        dateAvailable = ""
        propertyType = ""
        bedroomNumber = 0
        description = ""

        def save(self):
                # you will need to change the following to match your mysql credentials:

                self.description = unicode(self.description, errors='replace')
                self.description = self.description.encode('ascii','ignore')
                # TODO: might need to convert the other strings in the advert if there are any unicode conversetion errors

                sql = "INSERT INTO adverts (url,title,pricePCM,pricePW,location,dateAvailable,propertyType,bedroomNumber,description) VALUES('"+self.url+"','"+self.title+"',"+str(self.pricePCM)+","+str(self.pricePW)+",'"+self.location+"','"+self.dateAvailable+"','"+self.propertyType+"',"+str(self.bedroomNumber)+",'"+self.description+"' )"


In we convert the unicode output that BeautifulSoup gives us into plain ASCII so that we can put it in the MySQL database without any problems. I could have used Unicode in the database as well, but the chances of really needing Unicode for representing Gumtree ads is quite slim. If you intend to use this code then you will also want to enter the MySQL credentials for your database.

class listing:


        def addAdvertURL(self,url):


from BeautifulSoup import BeautifulSoup          # For processing HTML
import urllib2
from advert import advert
import time

class scrapeAdvert:

        page = ""
        soup = ""

        def scrape(self,advertURL):

                # give it a bit of time so gumtree doesn't
                # ban us

                url = advertURL
                # print "-- scraping "+url+" --"
                page = urllib2.urlopen(url)
                self.soup = BeautifulSoup(page)

                self.anAd = advert()

                self.anAd.url = url
                self.anAd.title = self.extractTitle()
                self.anAd.pricePW = self.extractPricePW()
                self.anAd.pricePCM = self.extractPricePCM()

                self.anAd.location = self.extractLocation()
                self.anAd.dateAvailable = self.extractDateAvailable()
                self.anAd.propertyType = self.extractPropertyType()
                self.anAd.bedroomNumber = self.extractBedroomNumber()
                self.anAd.description = self.extractDescription()

        def extractTitle(self):

                location = self.soup.find('h1')
                string = location.contents[0]
                stripped = ' '.join(string.split())
                stripped = stripped.replace("'",'"')
                # print '|' + stripped + '|'
                return stripped

        def extractPricePCM(self):

                location = self.soup.find('span',attrs={"class" : "price"})
                        string = location.contents[0]
                except AttributeError: # for ads with no prices set
                        return 0
                except ValueError: # for ads with pw specified
                        return 0

                stripped = string.replace('£','')
                stripped = stripped.replace('pcm','')
                stripped = stripped.replace(',','')
                stripped = stripped.replace("'",'"')
                stripped = ' '.join(stripped.split())
                # print '|' + stripped + '|'
                return int(stripped)

        def extractPricePW(self):

                location = self.soup.find('span',attrs={"class" : "price"})
                        string = location.contents[0]
                except AttributeError: # for ads with no prices set
                        return 0
                except ValueError: # for ads with pcm specified
                        return 0
                stripped = string.replace('£','')
                stripped = stripped.replace('pw','')
                stripped = stripped.replace(',','')
                stripped = stripped.replace("'",'"')
                stripped = ' '.join(stripped.split())
                # print '|' + stripped + '|'
                return int(stripped)

        def extractLocation(self):

                location = self.soup.find('span',attrs={"class" : "location"})
                string = location.contents[0]
                stripped = ' '.join(string.split())
                stripped = stripped.replace("'",'"')
                # print '|' + stripped + '|'
                return stripped

        def extractDateAvailable(self):

                current_year = '2011'

                ul = self.soup.find('ul',attrs={"id" : "ad-details"})
                firstP = ul.findAll('p')[0]
                string = firstP.contents[0]
                stripped = ' '.join(string.split())
                date_to_convert = stripped + '/'+current_year
                        date_object = time.strptime(date_to_convert, "%d/%m/%Y")
                except ValueError: # for adverts with no date available
                        return ""

                full_date = time.strftime('%Y-%m-%d %H:%M:%S', date_object)
                # print '|' + full_date + '|'
                return full_date

        def extractPropertyType(self):

                ul = self.soup.find('ul',attrs={"id" : "ad-details"})
                        secondP = ul.findAll('p')[1]
                except IndexError: # for properties with no type
                        return ""
                string = secondP.contents[0]
                stripped = ' '.join(string.split())
                stripped = stripped.replace("'",'"')
                # print '|' + stripped + '|'
                return stripped

        def extractBedroomNumber(self):

                ul = self.soup.find('ul',attrs={"id" : "ad-details"})
                        thirdP = ul.findAll('p')[2]
                except IndexError: # for properties with no bedroom number
                        return 0
                string = thirdP.contents[0]
                stripped = ' '.join(string.split())
                stripped = stripped.replace("'",'"')
                # print '|' + stripped + '|'
                return stripped

        def extractDescription(self):

                div = self.soup.find('div',attrs={"id" : "description"})
                description = div.find('p')
                contents = description.renderContents()
                contents = contents.replace("'",'"')
                # print '|' + contents + '|'
                return contents

In there are a lot of string manipulation statements to pull out any unwanted characters, such as the ‘pw’ characters (short for per week) found in the price string, which we need to remove in order to store the property price per week as an integer.

Using BeautifulSoup to pull out elements is quite easy, for example:

ul = self.soup.find('ul',attrs={"id" : "ad-details"})

That finds all the HTML elements under the tag id=”ad-details”, so all the list elements in that list. More detail can be found in the Beautiful Soup documentation which is very good.

from BeautifulSoup import BeautifulSoup          # For processing HTML
import urllib2
from listing import listing
import time

class scrapeListing:

        soup = ""
        url = ""
        aListing = ""

        def scrape(self,url):
                # give it a bit of time so gumtree doesn't
                # ban us

                print "scraping url = "+str(url)

                page = urllib2.urlopen(url)
                self.soup = BeautifulSoup(page)

                self.aListing = listing()
                self.aListing.url = url
                self.aListing.adverturls = self.extractAdvertURLs()
                self.aListing.nextLink = self.extractNextLink()

        def extractAdvertURLs(self):

                toReturn = []
                h3s = self.soup.findAll("h3")
                for h3 in h3s:
                        links = h3.findAll('a',{"class":"summary"})
                        for link in links:
                                print "|"+link['href']+"|"

                return toReturn

        def extractNextLink(self):

                links = self.soup.findAll("a",{"class":"next"})
                        print ">"+links[0]['href']+">"
                except IndexError: # if there is no 'next' link found..
                        return ""
                return links[0]['href']

The extractNextLink method here extracts the pagination ‘next’ link which will bring up the next listing page from the selection of listing pages to browse. We use it to step through the pagination ‘sequence’ of resultant listing pages.

from scrapeListing import scrapeListing
from scrapeAdvert import scrapeAdvert
from listing import listing
from advert import advert
import MySQLdb
import _mysql_exceptions

# change this to the gumtree page you want to start scraping from
url = ""

while url != None:
        print "scraping URL = "+url
        sl = ""
        sl = scrapeListing()
        for advertURL in sl.aListing.adverturls:
                sa = ""
                sa = scrapeAdvert()
                except _mysql_exceptions.IntegrityError:
                        print "** Advert " + sa.anAd.url + " already saved **"
                sa.onAd = ""

        url = ""
        if sl.aListing.nextLink:
                print "nextLink = "+sl.aListing.nextLink
                url = sl.aListing.nextLink
                print 'all done.'

This is the file you run to kick off the scrape. It uses an MySQL IntegrityError try/except block to pick out when an advert has already been entered into the database, this will throw an error because the URL of the advert is the primary key in the database. So no two records can have the same primary key.

The URL you provide it above gives you the starting page from which to scrape from.

The above code worked well for scraping several hundred Manchester Gumtree ads into a database, from which point I was able to use a combination of phpMyAdmin and OpenOffice Spreadsheet to analyse the data and find out useful statistics about the property market in said area.

Download the scraper source code in a tar.gz archive

Note: Due to the nature of web scraping, if – or more accurately, when – Gumtree changes its user interface, the scraper I have written will need to be tweaked accordingly to find the right data. This is meant to be an informative tutorial, not a finished product.

RESTful Web Services

Hammock with the background of a clear blue sky

REST (Representational State Transfer) is a way of delivering web services. When a web service conforms to REST, it is known as RESTful. The largest RESTful web service is the Hypertext Transfer Protocol (HTTP) which you use every day to send and receive information from web servers while browsing the internet.

To implement RESTful web services, you should implement four methods: GET, PUT, POST and DELETE. Resources on RESTful web services are typically defined as collections of elements. The REST methods can either act on a whole collection, or a specific element in a collection.

A collection is usually logically defined as a hierarchy on the URL, for example take this fictitious layout:


The REST methods you use do different things depending on whether you are interacting with a Collection resource or an Element resource. See below:

On a Collection: ie:
GET – Lists the URLs of the collection’s members.
PUT – Replace the entire collection with another collection.
POST – Create a new element in a collection, returning the new element’s URL.
DELETE – Deletes the entire collection.

On an Element: ie:
GET – Retrieve the addressed element in the appropriate internet media type, ie: music file or image
PUT – Replace the addressed element of the collection, or if it doesn’t exist, create it in the parent collection.
POST – Treat the addressed element of the collection as a new collection, and add an element into it.
DELETE – Delete the addressed element of the collection.

REST is a simple and clear way of implementing the basic methods of data storage; CRUD (Create, Read, Update and Delete), see:,_read,_update_and_delete

‘Weather Forecast’ Calendar Service in PHP

The BBC provide 3 day weather RSS feeds for most locations in the UK. I thought it would be interesting to create a web service to turn the weather feed into calendar feed format, so I could have a constantly updated forecast of the next 3 days of weather mapped on to my iPhone’s calendar. Here it is on my iPhone:

Picture shows weather forecast on an iPhone calendar screenshot


The service is separated into five files:

  • ical.php – this contains the class ical which corresponds to a single calendar feed. A method called ‘addevent’ allows you to add new events to the calendar, and a method called ‘returncal’ redirects the resulting calendar file to the browser so people can subscribe to it using their calendar application.
  • forecast.php – this file contains the class forecast, which has properties for all aspects that we want to record for each day’s forecast, ie: Wind Speed and Humidity. It also contains the forecast set, which is a collection of forecast objects. The set class is serializable, which means each forecast object can be stored in a text file, including the Wind Speed, Humidity and all other things we want to record for each day.
  • scrape-weather.php – this file contains code that scrapes the weather feed, populates the forecast set with all the weather information for the next 3 days, and stores the result in a file called forecasts.ser.
  • forecasts.ser – this is all the data for the three day weather forecast, in serialized format. It is automatically deleted and recreated when the scrape-weather.php script is run.
  • reader.php – this file converts the forecasts.ser file into an iCal calendar, and outputs the iCal formatted result to the calendar application that accesses reader.php page.

It uses two external libraries:

  • MagpieRSS 0.72 – this popular library is used for reading the calendar RSS feed and converting it into a PHP object that is easier to manipulate by scrape-weather.php.
  • iCalcreator 2.8 – this is used for creating the output iCal format of the calendar in ical.php and outputting it to the browser in reader.php.



	function init(){
		$config = array( 'unique_id' => '' );
		  // set Your unique id
		$this->v = new vcalendar( $config );
		  // create a new calendar instance

		$this->v->setProperty( 'method', 'PUBLISH' );
		  // required of some calendar software
		$this->v->setProperty( "x-wr-calname", "Calendar Sample" );
		  // required of some calendar software
		$this->v->setProperty( "X-WR-CALDESC", "Calendar Description" );
		  // required of some calendar software
		$this->v->setProperty( "X-WR-TIMEZONE", "Europe/London" );
		  // required of some calendar software

	function addevent($start_year,$start_month,$start_day,$start_hour,$start_min,
		$vevent = & $this->v->newComponent( 'vevent' );
		  // create an event calendar component
		$start = array( 'year'=>$start_year, 'month'=>$start_month, 'day'=>$start_day, 'hour'=>$start_hour, 'min'=>$start_min, 'sec'=>0 );
		$vevent->setProperty( 'dtstart', $start );
		$end = array( 'year'=>$finish_year, 'month'=>$finish_month, 'day'=>$finish_day, 'hour'=>$finish_hour, 'min'=>$finish_min, 'sec'=>0 );
		$vevent->setProperty( 'dtend', $end );
		$vevent->setProperty( 'LOCATION', '' );
		  // property name - case independent
		$vevent->setProperty( 'summary', $summary );
		$vevent->setProperty( 'description',$description );
		$vevent->setProperty( 'comment', $comment );
		$vevent->setProperty( 'attendee', '' );

	function returncal(){
		// redirect calendar file to browser
forecasts = new ArrayObject();

	function store(){
		$store_path = $this->store_path;
		file_put_contents($store_path, serialize($this->set));

	function scrapecurrent(){
		$url = $this->feed_url;
		$rss = fetch_rss( $url );
		$message = "";
		if(sizeof($rss->items) != 3){
			die("Problem with BBC weather feed.. dying");
		$set = new forecast_set();
		$curdate = date("Y-m-d");
		echo $curdate;
		foreach ($rss->items as $item) {
			$href = $item['link'];
			$title = $item['title'];
			$description = $item['description'];
			$curyear = date('Y',strtotime(date("Y-m-d", strtotime($curdate)) . " +1 day"));
			$curmonth = date('m',strtotime(date("Y-m-d", strtotime($curdate)) . " +1 day"));
			$curday = date('d',strtotime(date("Y-m-d", strtotime($curdate)) . " +1 day"));
			preg_match('/Min Temp:.+?-*d*/',$title,$mintemp);
			preg_match('/Max Temp:.+?-*d*/',$title,$maxtemp);
			preg_match('/Wind Speed:.+?-*d*/',$description,$windspeed);
			$summary[0] = str_replace(': ','',$summary[0]);
			$summary[0] = str_replace(',','',$summary[0]);
			$mintemp[0] = str_replace('Min Temp: ','',$mintemp[0]);
			$maxtemp[0] = str_replace('Max Temp: ','',$maxtemp[0]);
			$windspeed[0] = str_replace('Wind Speed: ','',$windspeed[0]);
			$humidity[0] = str_replace('Humidity: ','',$humidity[0]);
			$mins[$i] = (int)$mintemp[0];	
			$maxs[$i] = (int)$maxtemp[0];
			$forecast = new forecast();
			$forecast->low = (int)$mintemp[0];
			$forecast->high = (int)$maxtemp[0];
			$forecast->year = (int)$curyear;
			$forecast->month = (int)$curmonth;
			$forecast->day = (int)$curday;
			$forecast->windspeed = $windspeed[0];
			$forecast->humidity = $humidity[0];
			$forecast->summary = ucwords($summary[0]);
			$curdate = date('Y-m-d',strtotime(date("Y-m-d", strtotime($curdate)) . " +1 day"));
		$this->set = $set;


$s = new scrape3day();

$c = new ical();
$f = unserialize(file_get_contents('forecasts.ser'));
	$weather_digest = "Max: ".$curforecast->high." Min: ".$curforecast->low." Humidity: ".$curforecast->humidity."% Wind Speed: ".$curforecast->windspeed."mph.";

SVN Version

If you have subversion, you can check out the project from: There are a couple extra files in that directory for my automated freezing weather alerts, but you can safely ignore those.


You will have to add this entry to your crontab to run once per day. You could set the script to run at midnight through adding the following:

0 0 * * *  

For example, in my case:

0 0 * * * /usr/local/bin/php /home/david_craddock/ 

You will then need to edit the contents of the $store_path and $feed_url variables in scrape-weather.php. Store_path should refer to a file path that the web server can create and edit files in, and feed_url should refer to the RSS feed of your local area that you have copied and pasted from the site, don’t use mine because your area is likely different. After that, you’re set to go.

Restoring Ubuntu 10.4’s Bootloader, after a Windows 7 Install

I installed Windows 7 after I had installed Ubuntu 10.4. Windows 7 overwrote the Linux bootloader “grub” on my master boot record. Therefore I had to restore it.

I used the Ubuntu 10.4 LiveCD to start up a live version of Ubuntu. While under the LiveCD, I then restored the Grub bootloader by chrooting into my old install, using the linux command line. This is a fairly complex thing to do, and so I recommend you use this approach only if you’re are confident with the linux command line:

# (as root under Ubuntu's LiveCD)

# prepare chroot directory

mkdir /chroot

# mount my linux partition

mount /dev/sda1 $d   # my linux partition was installed on my first SATA hard disk, on the first parition (hence sdA1).

# mount system directories inside the new chroot directory

mount -o bind /dev $d/dev
mount -o bind /sys $d/sys
mount -o bind /dev/shm $d/dev/shm
mount -o bind /proc $d/proc

# accomplish the chroot

chroot $d

# proceed to update the grub config file to include the option to boot into my new windows 7 install


# install grub with the new configuration options from the config file, to the master boot record on my first hard disk

grub-install /dev/sda

# close down the liveCD instance of linux, and boot from the newly restored grub bootloader


Ripping Movies onto the iPhone

I’m currently watching Persepolis, the 2008 animated film about a tomboy anarchist growing up in Iran. I’m watching this on my new iPhone 3GS, and the picture and audio quality is very good.

Here’s what I used to convert my newly bought Persepolis DVD, for watching on the iPhone.

1x Macbook (but you can use any intel mac)
1x iTunes
1x RipIt – Commercial Mac DVD Ripper (rips up to 10 DVDs on the free trial, $20 after)
1x Handbrake 32 – Freely available transcoder
1x VLC 32 – Freely available media player
1x DVD

* Ripit – rips the video and audio from the DVD, onto your computer
* Handbrake 32 – ‘transcodes’ the ripped video and audio, meaning – it converts it into an iPhone compatible video file.
* VLC 32 – is used by Handbrake 32 to get past any problems with converting the media.

Go to the following sites to fetch the software:

1. Ripit –
2. Handbrake 32 – (get the 32 bit version)
3. VLC 32 – (be sure to get the 32 bit version)

There’s currently a difficulty in getting the VLC 64 bit software for the Mac, and so although the 64 bit version is faster to use, you’re probably better off with 32 bit versions of both for now.

The Process

1) Rip the DVD.

Start RipIt. It will ask for a DVD, insert the DVD.. and point the resultant save location to the desktop. The ripping process takes about 40 minutes on my Macbook, you can check the progress by looking at the icon in the dock – it will be updated with the percentage of progress until completion. You can do other things on your mac while it’s ripping, even though the DVD drive will be occupied. Wait until it’s completed before continuing.

2) Transcode (convert) the ripped video file for use on the iPhone.

Start Handbrake. There are a bunch of transcoding settings called presets – those tell Handbrake what type of media player you want the converted video to work on. In handbrake on the right section of the window, select the iPhone preset. Then go to the file menu, select ‘Open’, and then select the video file that RipIt saved onto your desktop. Then select the destination for the converted video file. Then select the Start (green) button on Handbrake window, and it will start. You can now minimise handbrake and do other things. The transcoding process depends on the film, but takes about an hour on my Macbook. You can check on progress by maximizing the Handbrake window, and checking on the progress bar.

3) Move the converted video file onto your iPhone.

Once that’s done, you will have another media file on your desktop – this is the end result, a video file that will play on your iPhone. Simply connect your iPhone to your Mac, start up iTunes, and drag that file from your desktop into the iPhone icon on your iTunes window. It will take a couple of minutes to transfer, then eject the iPhone as normal

Now you can watch this new movie on your iPhone by going to the ‘Videos’ tab of your iPod app.

Forkbombs and How to Prevent Them

A forkbomb is a program or script that continually creates new copies of itself, that create new copies of themselves. It’s usually a function that calls itself, and each time that function is called, it creates a new process to run the same function.

You end up with thousands of processes, all creating processes themselves, with an exponential growth. Soon it takes up all the resources of your server, and prevents anything else running on it.

Forkbombs are an example of a denial of service attack, because it completely locks up the server it’s run on.

More worryingly, on a lot of Linux distributions, you can run a forkbomb as any user that has an account on that server. So for example, if you give your friend an account on your server, he can crash it/lock it up whenever he wants to, with the following shell script forkbomb:

:(){ :|:& };:

Bad, huh?

Ubuntu server 9.10 is vulnerable to this shell script forkbomb. Run it on your linux server as any user, and it will lock it up.

This is something I wanted to fix right away on all my linux servers. Linux is meant to be multiuser, and it has a secure and structured permissions system allowing dozens of users to log in and do their work, at the same time. However when any one user can lock up the entire server, this is not good for a multiuser environment.

Fortunately, fixing this on ubuntu server 9.10 is quite simple. You limit the maximum number of running processes that any user can create. So the fork bomb runs, but hits this ceiling, and eventually stops without the administrator having to do anything.

As root, edit this file, and add the following line:


*               soft    nproc   35

This sets the maximum process cap for all users, to be 35. The root user isn’t affected by this limit. This limit of 35 should be fine for remote servers that are not offering users gnome, kde, or any other graphical X interface. If you are expecting your users to be able to run applications like that, you may want to increase the limit to 50, and although this will increase the time forkbombs will take to exit, they should still exit without locking up your server.

Alternatively, you can setup an ‘untrusted’ and ‘trusted’ user groups, and assign that 35 limit to the untrusted users, giving trusted users access to the trusted group, which does not have that limit. Use these lines:


@untrusted               soft    nproc   35
@trusted               soft    nproc   50

I’ve tested these nproc limits on 8.10 and 9.10 ubuntu-server installs, but you should really test your own servers install, if possible, by forkbombing it yourself as a standard user, using the bash forkbomb above, once you’ve applied the fix. The fix is effective as soon as you’ve edited that file, but please note that you have to logout, and log back in again as a standard user before the new process cap is applied to your user account.

The Linux Root Directory, Explained

It’s helpful to know the basic filesystem on a Linux machine, to better understand where everything is supposed to go, and where you should start looking if you want to find a certain file.

Everything in Linux is stored in the “root directory”. On a windows machine, that would be equivalent to C:. C: is the main folder where everything is stored. On Linux we call this the “root directory”, or simply “/”. To go up to this root directory, type:

cd /

To list all the folders and files in the root directory, type this:

ls /

Alternatively, if you want to see the folders and files exactly the way I see them below for easy comparison, type this:

ls -lhaFtr --color /

Once you’ve typed in one of the ‘ls’ commands above, you’ll see some information similar to that on the screenshot below.. (please scroll down)..

Ubuntu Linux

Above you can see the files and folders in the root directory of my ubuntu linux server, after I’ve typed ‘ls /’. Ignore everything but the coloured names on the right, those coloured names are the names of the files and folders in this directory. Don’t worry about the shades of different colours either. It’s not really important to explain how they are coloured right now, just to explain the purpose behind each file or folder shown.

So let me explain the purpose behind each of these, in turn. I’ll include the same screenshot multiple times, so you can reference the explanations against it as you scroll down.


– Directory for linux security features, rarely visited by normal users like you or me.


– Traditional directory for the files from removable media, ie USB keys, external hard drives. Not used anymore, it only exists for historical purposes.


– Directory where files and directories end up when they’ve been recovered from a hard disc repair.

 cdrom -> media/cdrom/

– Link the files currently in your CDROM or DVDROM drive.


– New style directory for the files from removable media such as USB keys, external hard drives, etc. This is the new convention, and so you should always use media/ instead of mnt/, above.

vmlinuz.old -> boot/vmlinuz-2.6.31-17-generic

– A backup of your most recent old Linux operating system kernel, ie: your operating system. Don’t delete this =)

initrd.img.old -> boot/initrd.img-2.6.31-17-generic

– Another part of the backup for your most recent old Linux kernel.


– An empty directory reserved for you to put third-party programs and software in.


– Operating system drivers and kernel modules live here. Also contains all system libraries, so when you compile a new program from the source code, it will use the existing code libraries stored here.


– Basic commands that everyone uses, like “ls” and “cd”, live here.


– This is where all user-supplied software should go; ie: software that you install that doesn’t normally come with the operating system. Put all programs here.


– Basic but essential system administration commands that the admin user only uses, ie: reboot, poweroff, etc.

vmlinuz -> boot/vmlinuz-2.6.31-20-generic

– Your actual operating system kernel, ie: the one that is running right now. Don’t delete this.

initrd.img -> boot/initrd.img-2.6.31-20-generic

– Another part of the kernel that is running right now.


– Reserved for Linux kernel files, and other things that need to be loaded on bootup. Don’t touch these.


– Proc is a handy way of accessing critical operating system information, through a bunch of files. Ie: try typing ‘cat /proc/cpuinfo’. That queries the current kernel for the information on your processors (CPUs), and returns the info for you in a text file.


– Like proc/, this is another bunch of files that aren’t files at all, but ‘fake’ files. When you access them, the operating system goes away and finds out information, and offers that information up as a text file to you.


– Device files. In here live the device files for your hard drives, your CD/DVD drives, your soundcard, your network card.. in fact anything you have installed that Linux uses, it has a counterpart in here that is automatically added and removed by the OS. Don’t ever delete, move or rename any of the files here.


– The directory that you’ll use the most. Every user on your Linux machine, except the system administrator, has a folder here. This is where each user is meant to store all their documents. Think of it as the Linux ‘My Documents’ folder.


– This is a catch-all directory for ‘variables’, ie things that the OS has to write to, and vary, as part of its operation. Examples include: email inboxes for all users, cache files, the lock files that are generated and removed as part of normal program execution, and also the /var/www directory. /var/www is a directory you will probably see and use a lot, as it is where all the websites are stored that your linux machine serves when operating as a web server. /var/log is also a very important directory, and contains ‘log’ files which is a kind of “diary” that the linux OS uses to explain exactly what it’s done, as it happens, so you can easily find out what’s been going on by viewing the right log file.


– The space for any and all temporary files. Store files here that you want to throw away quite quickly. Depending on your configuration, all files and folders in the /tmp directory may be deleted on system reboot, or more frequently, perhaps every day.


– This is the system administrators ‘my documents’ folder. Anything that the sysadmin stores, for example: programs that he downloads, are put here. Not accessible to anyone else but the system administrator.


– Configuration files. Any and all program configuration files or information belong here. Think of it like the windows registry, except every registry entry is a text file that you can open up and edit, and also copy, move around, and save. You will typically have to create configuration files yourself sometimes, and put them in this directory. They are almost always simple text files.

And that’s a basic overview of the files and folders in the root directory of your linux machine.

My minimal VIM config

This is the absolute minimum I do when I have to log onto a new server or shell account that I haven’t used before, that I will need to edit text files with.

First I figure out whether VIM is really installed. A lot of installs, especially those based on ubuntu, ship with VI aliased to VIM, but the VIM install is usually not really VIM at all, and behaves exactly like VI but with some minor bugs fixed. This is not what I want.

So first I figure out what distribution of linux I’m using through executing the following command:

cat /etc/issue

Then if it’s ubuntu, which doesn’t ship with the full VIM package on a lot of default installs, then I usually do this, presuming I have admin access. In practice I usually have admin access because people are generous with this when they want you to fix their server =) Anyway, if I have admin access, I install ubuntu’s ‘vim full’ package, which is aliased as ‘vim’:

sudo apt-get install vim

Now I can move onto my config. Occasionally there will be a global system config, but I probably want to override that anyway. So I create a vim configuration file specific to me in my home directory:

set bg=dark
set backspace=2

The first line sets the background to be dark, so I can see what is going on when I use a dark terminal program, such as putty, mac osx’s terminal.. in fact nearly all terminal programs use a dark background, so this setting is almost compulsory.

The second line configures the behaviour of the backspace key, so when I go the the start of a line, and press backspace, it adopts the wordprocessor conventional behaviour of skipping to the above line. Otherwise it uses the default VI behaviour, which is probably not intuitive at all to anyone who didn’t grow up on UNIX mainframes and such.

The very existence of a user-supplied configuration file will also jolt the VIM editor into ‘non compatible mode’, where it figures out automatically that it should be doing all the advanced VIM things, instead of just acting as a VI replacement. This should mean that if you create a config file, syntax highlighting is already turned on, another must for me. Otherwise you can explicitly set it with the line ‘syntax on’, but I never have to do this anymore.

And that’s it.

Using the Linux command ‘Watch’ to test Cron jobs and more

OK, so you have added a cron job that you want to perform a routine task every day at 6am. How do you test it?

You probably don’t want to spend all night waiting for it to execute, and there’s every chance that when it does execute, you won’t be able to find out whether it is executing properly – the task might take 30 minutes to run, for example. So every time you debug it and want to test it again, you have to wait until 6am the following day.

So instead, configure that cron job to run a bit earlier than that, say in 10 minutes, and monitor the execution with a ‘watch’ command, so you can see if it’s doing what you want it to.

‘watch’ is a great command that will run a command at frequent intervals, by default, every 2 seconds. It’s very useful when chained with the ‘ps’ command, like the following:

watch 'ps aux | grep bash'

What that command will do, is continually monitor your server, and maintain an updated list that changes every 2 seconds, of every instance of the bash shell. When someone logs in and spawns a new bash shell, you’ll know about it. When a cron’d command runs that invokes a bash shell before executing a shellscript, you’ll know about it. When someone writes a badly written shell script, and runs it invoking about 100 bash shells by accident, flooding your servers memory, you’ll know about it.

OK so back to the cron example. Suppose I’m testing a cronjob that should invoke a shell script that runs an rsync command. I just set the cron job to run in 5 minutes, then run this command:

watch 'ps aux | grep rsync'

Here is the result.. every single rsync command that is running on my server is displayed, and the list is updated every 2 seconds:

Every 2.0s: ps aux | grep rsync                                              Sat Mar 13 15:59:35 2010

root     16026  0.0  0.0   1752   480 ?        Ss   15:28   0:00 /bin/sh -c /opt/remote/rsync-matt/cr
root     16027  0.0  0.0   1752   488 ?        S    15:28   0:00 /bin/sh /opt/remote/rsync-matt/crond
root     16032  0.0  0.1   3632  1176 ?        S    15:28   0:00 rsync -avvz --remove-source-files -P
root     16033  0.5  0.4   7308  4436 ?        R    15:28   0:09 ssh -l david someotherhost rsync --se
root     16045  0.4  0.1   4152  1244 ?        S    15:28   0:07 rsync -avvz --remove-source-files -P
root     18184  0.0  0.1   3176  1000 pts/2    R+   15:59   0:00 watch ps aux | grep rsync
root     18197  0.0  0.0   3176   296 pts/2    S+   15:59   0:00 watch ps aux | grep rsync
root     18198  0.0  0.0   1752   484 pts/2    S+   15:59   0:00 sh -c ps aux | grep rsync

Now I can see the time ticking away, and when the cron job is run, I can watch in real-time as it invokes rsync, and I can keep monitoring it to make sure all is running smoothly. This proves to be very useful when troubleshooting cron jobs.

You can also run two commands at the same time. You can actually tail a log file and combine it with the process monitoring like so:

watch 'tail /var/log/messages && ps aux | grep rsync'

Try this yourself. It constantly prints out the last ten lines of the standard messages log file every two seconds, while monitoring the number of rsync processes running, and the commands used to invoke them. Tailor it to the cron’d job you wish to test.

Watch can be used to keep an eye on other things also. If you’re running a multi-user server and you want to see who’s logged on at any one time, you can run this command:

watch 'echo CURRENT: && who && echo LASTLOGIN: && lastlog | grep -v Never'

This chains 5 commands together. It will keep you updated with the current list of users logged in to your system, and it will also give you a constantly updated list of those users who have ever logged in before, with their last login time.

The following shows the output of that command above on a multi-user server I administrate, and will refresh with current information every 2 seconds until I exit it:

Every 2.0s: echo CURRENT: && who && echo LASTLOGIN: && lastlog | grep -v Never                                                             Sat Mar 13 07:48:32 2010

mark     tty1         2010-02-23 11:08
david    pts/2        2010-03-13 07:48 (wherever)
mike     pts/4        2010-02-26 07:53 (wherever)
mike     pts/5        2010-02-26 07:53 (wherever)

Username         Port     From           Latest
mark               pts/6    wherever      Thu Mar 11 23:24:36 -0800 2010
mike               pts/0    wherever      Sat Mar 13 03:54:28 -0800 2010
dan                pts/4    wherever      Fri Jan  1 08:46:29 -0800 2010
sam                pts/1    wherever      Sat Jan 30 08:06:01 -0800 2010
rei                pts/2    wherever      Thu Dec 10 11:45:39 -0800 2009
david              pts/2    wherever      Sat Mar 13 07:48:05 -0800 2010

This shows that mark, david and mike are currently logged on. Mark is logged in on the server’s physical monitor and keyboard(tty1). Everyone else is logged in remotely. Mike currently has two connections, or sessions, on the server. We can also see the list of users that have logged in before – ie: are active users, and when they last logged on. I immediately notice, for example, that rei hasn’t logged in for 4 months and probably isn’t using her account.

(Normally this command will also provide IP addresses and hostnames of where the users have logged on from, but I’ve replaced those with ‘wherever’ for privacy reasons)

So.. you can see that the ‘watch’ command can be a useful window into what is happening, in real-time, on your servers.

Regex in VIM.. simple

There are more than a gazillion ways to use regexs. I am sure they are each very useful for their own subset of problems. The sheer variety can be highly confusing and scary for a lot of people though, and you only need to use a few approaches to accomplish most text-editing tasks.

Here is a simple method for using regex in the powerful text editor VIM that will work well for common use.


We are going to take the “search and delete a word” problem for an example. We want to delete all instances of the singular noun “needle” in a text file. Let’s assume there are no instances of the pluralisation “needles” in our document.

  1. Debug on.. turn some VIM options on
    :set hlsearch
    :set wrapscan
    – this will make all regex expressions possible to debug by visually showing what they match in your document (first line) and make all searches wrap around instead of just search forward from your current position, which is the default. (second line)
  2. Develop and Test.. your regex attempts by using a simple search. Here we see three attempts at solving the problem: :/needl
    – our third try is correct, and highlights all words that spell “needle”. The < and > markers allow you to specify the beginning and the end of a word. Play with different regexs using the simple search and watching what is highlighted, until you discover one that works for you.

  3. Run… your regex:%s/<needle>//g – once you’ve figured out a regex, run the regex on your document. This example will execute a search for the word “needle” and delete every one. If you wanted to substitute needle for another word, you would put the word in between the // marks. As we can see, there is nothing between the marks in this example, so it will replace instances of “needle” with nothing. This means it will serve to delete every instance of the word “needle”.
  4. Check things are OK… with your document :/<needle>
    – has the regex done what you want? Use the search function to see if regex has done what you wanted it to do. The above examples show different searches through the document to see if different variations remain. Any matches of these searches will highlight any problems. You can use the lower-case N(next search result) and lower-case P(previous search result) commands to navigate through any found search results. You must remember to manually look through the document and see what the regex has changed, make sure there aren’t any unwanted surprises!
  5. Recover… from any mistakes u – just press the U key (with no capslock or shift). This will undo the very last last change made to the document.
  6. Redo… any work that you need to <ctrl>-r – use the redo fuction; press the CONTROL and R keys together (with no capslock or shift). This will redo the last change made to the document.
  7. Finish up and Write… to file :w – write your work on the document to file. Even after you have written out to file, you can probably still use the undo function to get back to where you were, but it’s best practice to not rely on this, and only write once you’re done.
  8. Debug off.. turn some options off
    :set nohlsearch
    :set nowrapscan
    – turn off the regular expression highlighting (line 1). turn off the wraparound searching (line 2). You can leave either or both options on if you want, they’re often useful. Up to you.

Use a combination of these wonderful commands to test and improve your regex development skills in VIM.


Here I use the shorthand “#…” to denote comments on what I’m doing… if you want to copy and paste the example as written, then you will have to remove those comments.

1. Remove ancient uppercase <BLINK> tags from a document.

:set wrapscan # debug on
:set hlsearch # debug on
:/<BLINK> # try 1.. bingo! first time.. selected all tags I want
:%s/<BLINK>//g # lets execute my regex remove
:/BLINK # check 1.. testing things are OK in my file by searching through..
:/blinked # check 2.. yep thats ok..
:/<BLINK> # check 3.. yep looks ok... the problem tags are gone
# ...manual scroll through the document.. looks good!
:w # write out to file
:set nohlsearch # debug off
:set nowrapscan # debug off

2. Oh no! We missed some lower and mixedcase <bLiNK> tags that some sneaky person slipped in. Let’s take them out.

:set wrapscan # debug on
:set hlsearch # debug on
:/<blink> # try 1.. hm.. worked for many, but didnt match BlInK or blINK mixedcase
:/<blink>/i # try 2.. much better.. seems to have worked!
:%s/<blink>//i # lets execute my regex remove
:/BLINK # check 1.. testing things are OK in my file by searching through..
:/blinked # check 2.. yep thats ok..
:/<blink> # check 3.. yep thats fine.
:/<blink>/i # check 4.. looks good... problem solved
# ...manual scroll through the document.. looks much better!
:w # write out to file
:set nohlsearch # debug off
:set nowrapscan # debug off

3. Replacing uppercase or mixedcase <BR> tags with the more modern <br>.

:set wrapscan # debug on
:set hlsearch # debug on
:/<BR> # try 1.. hmm.. just uppercase.. not gonna work..
:/<br> # try 2.. hmm.. just lowercase..
:/<BR>/i # try 3.. ahh.. that'll be it then
:%s/<BR>/<br>/gi # lets execute my regex substitution
:/BR # check 1.. testing things are OK in my file by searching through..
:/br # check 2.. yep thats ok..
/bR # check 3 ..yup..
:/<BR>/i # check 4.. yep looks ok... the problem tags seem to be gone
# ...manual scroll through the document.. looks good!
:w # write out to file
:set nohlsearch # debug off
:set nowrapscan # debug off

For More..

Regexs are the gift that just keeps on giving. Here are some good resources on regexs in general, and regexs in VIM.

VirutalHosts on CentOS

A common task when setting up an Apache webserver under Linux, is writing a httpd.conf file. The httpd.conf file is the main configuration file for Apache. One of the main reasons to edit the httpd.conf file is to setup virtual hosts In Apache. A Virtual host configuration allows several different domains to be run off a single instance of Apache, on a single IP. Each host is a ‘Virtual host’, and typically has a different web root, log file, and any number of subdomains aliased to it. The virtualhosts are configured in parts of the httpd.conf file that look like this:

    DocumentRoot /var/www/html/
    ErrorLog logs/
    CustomLog logs/ common

Now on Ubuntu, virutalhosts are made easy. The httpd.conf is split into several files. Each virutalhost has a different file in /etc/apache2/sites-available. When you want to activate a particular vitualhost, you create a symbolic link from /etc/apache2/sites-enabled/mysiteto /etc/apache2/sites-available/mysite (if you wanted to call your site configuration file ‘mysite’). When apache boots up, it loads all the files it can find in /etc/apache2/sites-available/* and that determines which virutalhosts it loads. If there is not a link from /etc/apache2/sites-available/ to your virutalhost file, it won’t load it. So you can easily remove the links in /etc/apache2/sites-available without deleting the actual virutalhost file. Therefore you can easily toggle which virtualhosts get loaded.

CentOS uses a different structure. Everything is lumped into /etc/apache/httpd.conf. So there is no way to easily toggle virutalhosts on/off, and everything is a bit more chaotic. I’ve just had to setup a new CentOS webserver, and I struggled for a bit after being used to ubuntu-server. Here’s a format you can use if you’re in the same boat, and you have to setup httpd.conf files for CentOS:

NameVirtualHost *:80 # this is eseential for for name-based switching

# an example of a simple VirtualHost that serves data from 
# /var/www/html/ to anyone who types in
# to the browser

    DocumentRoot /var/www/html/
    ErrorLog logs/
    CustomLog logs/ common 

# an example of a VirutalHost with apache overrides allowed, this means you can use
# .htaccess files in the servers web root to change your config dynamically

    DocumentRoot /var/www/html/
    ErrorLog logs/
    CustomLog logs/ common
      AllowOverride All
      AllowOverride AuthConfig
      Order Allow,Deny
      Allow from All

# an example of a VirutalHost with apache overrides allowed, and two subdomains
# (mail and www) that both point to the same web root

    DocumentRoot /var/www/html/
    ErrorLog logs/
    CustomLog logs/ common
      AllowOverride All
      AllowOverride AuthConfig
      Order Allow,Deny
      Allow from All

# .. etc

With the above structure, you can add as many VirutalHosts to your configuration as you have memory to support (typically dozens). Apache will decide on which to choose based on the ‘ServerName’ specified in each VirtualHost section. Just remember to add that all-important NameVirtualHost: *:80 in the beginning.

Once you’ve got your httpd.conf file the way you like it, be sure to test it before you restart apache. If you restart apache and your httpd.conf file has errors in it, Apache will abort the load process. This means that all the websites on your webserver will fail to load. I always use apachectl -t or apache2ctl -t before I restart. That will parse the httpd.conf file and check the syntax. Once that’s OK, then you can issue a /etc/init.d/httpd restart to restart Apache.

Scraping Wikipedia Information for music artists, Part 2

I’ve abandoned the previous Wikipedia scraping approach for, as it was unreliable and didn’t pinpoint the right Wikipedia entry – ie: a band called ‘Horses’ would pull up a Wikipedia bio on the animal – which doesn’t look very professional. So instead, I have used the Musicbrainz API to retrieve some information on the artist; the homepage URL, the correct Wikipedia entry, and any genres/terms the artist has been tagged with.

It would be simple to extend this to fetch the actual bio from a site like (which provides XML-tagged Wikipedia data), now that you always have the correct Wikipedia page reference to fetch the data from.

(You will need to download the Musicbrainz python library to use this code):

import time
import sys
import logging
from musicbrainz2.webservice import Query, ArtistFilter, WebServiceError
import musicbrainz2.webservice as ws
import musicbrainz2.model as m

class scrapewiki2(object):

  def __init__(self):

  def getbio(self,artist):

    art = artist
    logger = logging.getLogger()

    q = Query()

      # Search for all artists matching the given name. Limit the results
      # to the 5 best matches. The offset parameter could be used to page
      # through the results.
      f = ArtistFilter(name=art, limit=1)
      artistResults = q.getArtists(f)
    except WebServiceError, e:
      print 'Error:', e

    # No error occurred, so display the results of the search. It consists of
    # ArtistResult objects, where each contains an artist.

    if not artistResults:
      print "WIKI SCRAPE - Couldn't find a single match!"
      return ''

    for result in artistResults:
      artist = result.artist
      print "Score     :", result.score
      print "Id        :",
        print "Name      :",'ascii')
      except Exception, e:
      print 'Error:', e

    print "Id         :",
    print "Name       :",

    # Get the artist's relations to URLs (m.Relation.TO_URL) having the relation
    # type ''. Note that there could
    # be more than one relation per type. We just print the first one.
    wiki = ''
    urls = artist.getRelationTargets(m.Relation.TO_URL, m.NS_REL_1+'Wikipedia')
    if len(urls) > 0:
      print 'Wikipedia:', urls[0]
      wiki = urls[0]

    # List discography pages for an artist.
    disco = ''
    for rel in artist.getRelations(m.Relation.TO_URL, m.NS_REL_1+'Discography'):
      disco = rel.targetId
      print disco

      # The result should include all official albums.
      inc = ws.ArtistIncludes(
        releases=(m.Release.TYPE_OFFICIAL, m.Release.TYPE_ALBUM),
      artist = q.getArtistById(, inc)
    except ws.WebServiceError, e:
      print 'Error:', e

    tags = artist.tags

    toret = ''
      toret = ''+art+' Wikipedia Articlen'
      toret = toret + ''+art+' Main Siten'
      toret = toret + '
Tags: '+(','.join(t.value for t in tags))+'n' return toret sw2 = scrapewiki2() # unit test print sw2.getbio('Blur') print sw2.getbio('fatboy slim')

Apologies to the person that left several comments on the previous wikipedia scraping post, I have disabled comments temporarily for now due to heavy amounts of spam, but you can contact me using the following address: (subsitute first two @s for ‘.’s ). I also hope this post answers your question.

Scraping artists bios off of Wikipedia

I’ve been hacking away at and I’ve been looking for a way of automatically sourcing biographical information from artists, so that visitors are presented with more information on the event.

The Songbird media player plugin ‘mashTape’ draws upon a number of web services to grab artist bio, event listings, youtube vidoes and flickr pictures of the currently playing artist. I was reading through the mashTape code, and then found this posting by its developer, which helpfully provided the exact method I needed.

I then hacked up two versions of the code, a PHP version using simpleXML:

    if($ar[2] == ''){
      $wikikey = $ar[4]; // more than likely to be the wikipedia page
      return ""; // nothing on wikipediea
    $url = "$wikikey";
    $x = file_get_contents($url);
    $s = new SimpleXMLElement($x);
    $b = $s->xpath("//p:abstract[@xml:lang='en']");
     return $b[0];

and a pythonic version using the amara XML library (has to be installed seperately):

import amara
import urllib2
from urllib import urlencode

def getwikikey(band):
  url = ""+band+"%22&";
  print url
  doc = amara.parse(f)
  url = str(doc.ResultSet.Result[0].Url)
  return url.split('/')[4]

def uurlencode(text):
   """single URL-encode a given 'text'.  Do not return the 'variablename=' portion."""
   blah = urlencode({'u':text})
   blah = blah[2:]
   return blah

def getwikibio(key):
  url = ""+str(key);
  print url
  except Exception, e:
    return ''
  doc = amara.parse(f)
  b = doc.xml_xpath("//p:abstract[@xml:lang='en']")
    r = str(b[0])
  except Exception, e:
    return ''
  return r

def scrapewiki(band):
    key = getwikikey(uurlencode(band))
  except Exception, e:
    return ''
  return getwikibio(key)

  #unit test
  #print scrapewiki('guns n bombs')
  #print scrapewiki('diana ross')

There we go, artist bio scraping from wikipedia.

A poor man’s VMWare Workstation: VMWare Server under Ubuntu 7.10 + VMWare Player under Windows XP

I finally setup my Dell Lattitude D630 laptop the way I wanted it last night, and thought I’d do a quick writeup about it. Here is the parttition table:

  1. A 40GB Windows XP partition, with VMWare Player installed, which I will be using for Windows applications that don’t play well in virtualised mode (eg media applications). I will also be using it as the main platform for running VMs.
  2. A basic 5GB root + 1.4GB swap 7.10 Ubuntu server partition, with VMWare Server installed (for creating, advanced editing and performing network testing on VMs). I used these VMWare server on Ubuntu 7.10 tutorials.
  3. A 36GB NTFS partition for storing VMs
  4. A 26GB NTFS media partition for media I want to share between VMs and the two operating systems on the disc.

We use VMWare servers at work to host our infrastructure, so this setup will be very useful for me. I can now:

  1. Take images off the servers at work and bring them up, edit them and test their network interactions under my local VMWare Server running on my Linux install.
  2. From within my windows install, I can bring up a Linux VM and use Windows and Linux side by side.

My Nabaztag

Nabaztag (Armenian for “rabbit”) is a Wi-Fi enabled rabbit, manufactured by Violet. The Nabaztag is a “smart object”; it can connect to the Internet (for example to download weather forecasts, read its owner’s email, etc). It is also fully customizable and

Here is our Nabaztag – Francois Xavier:

Meet Francois

Of course, I’ve been messing around with poor old Francois’s programming..

With the help of OpenNab, a proxy server that masquerades as an official Nabaztag server, you can make your Nabaztag do all kinds of tricks. At the moment I’m getting him to read out what’s currently showing on TV when someone presses his button.

Francois in the Surgery

Here’s the technical details:

Whenever the button is pressed on Francois, he sends a message destined for the official Nabaztag server, that is caught by the proxy server. The proxy server then executes a PHP script. This PHP script grabs the current TV listings from a RSS feed, composes them into a readable list, and then sends the list to the official Nabaztag server, which converts the list (using a text-to-speech synthesis program) into audio files, and streams those audio files to Francois.

The result is a rabbit that reads the TV listings. A useful addition to our TV room.

Coming soon: More technical information on how you can do this yourself.