2013 Career Retrospective

2015 Update: The “Device Hive” project has now been changed to be called ‘Hive CI’ and it is being maintained by a team of developers at the BBC, which I am no longer a part of. It is in the process of being opensourced, see: http://bbc.github.io/hive-ci/

This year has been quite a busy and eventful one for me.

Connected Red Button
At the start of the year, I was working on the Connected Red Button team within the BBC. Connected Red Button is a major ongoing project in the Television and Mobile Platforms department at BBC North. Its aim is to replace the classic Red Button text service (which itself is the successor to Ceefax) with a new updated all-singing all-dancing interactive portal to internet content, available on Smart TVs and modern set top boxes. Currently Connected Red Button is live and accessible by pressing the Red Button on the new Virgin Media TiVo boxes. You can access the latest version of iPlayer, and the BBC News and BBC Sport smart TV apps from within one easy portal.

On CRB, I was working on the Java/Spring services layer, which connects to the various APIs of services like iPlayer that we have at the BBC, and gets all the content ready for the frontend. This data then gets passed to the very nice looking AS2 frontend to display, and that’s how it appears on your TV that is connected to your Virgin TiVo box in your living room.

The next version of CRB is being developed for Smart TVs with HTML browsers (so the frontend is in HTML5/JS instead of AS2). This type of Smart TV includes most of the new smart TVs that have come out recently, and will continue to be released in the future. The BBC (and the wider industry) is really anticipating that most TVs will be smart TVs in 5-10 years, and so the reach of Connected Red Button HTML will increase substantially so that most of the audience can be served by new applications that run on smart TVs.


Smartbridge is the transitional frontend that is displayed to both 1) users that have smart TVs and internet connected STB (Set Top Boxes) capable of running our latest BBC applications such as Connected Red Button and the latest versions of iPlayer, and also 2) our traditional users that still have normal (un-smart) TVs that can only receive Broadcast Red Button (the service you get by pressing the red button on any BBC channel). Smartbridge is not a branded BBC product, it is the behind the scenes magic that helps ensure that we maintain the availability of traditional Red Button services as we simultaneously launch and develop Connected Red Button.

For several months this year I was working on Smarbridge. On Smartbridge, I was working to get the project released and out the door, which meant Java/Spring/Hibernate work, with MySQL database tinkering and some broadcast work configuring and testing the TVs that worked with Smartbridge. It was successfully released in October.

Device Hive

Device Hive is the working name for an BBC system that is an Android and iOS emulator and physical device testing platform. A server will run the Device Hive software, and mobile developers will be able to plug in their Android mobile or tablet or iPhone or iPad to the server via a USB cable, and choose an application to run on it, such as BBC iPlayer or BBC Radio Player. This application will then automatically be downloaded onto the device, and the automated Cucumber/Calabash test suite will be executed, which will step through every screen of the mobile application, triggering buttons, scrolling up and down and generally exercising every aspect of the mobile application. There will also be an option to run install and run applications on Android or iOS emulators, so we could have 10 emulators running at once, each running different segments of the automated test suite, and uploading the test results to a logging server.

Device Hive will mean, in particular, that we can test BBC mobile applications on the plethora of Android devices available, every make and model that we own of the different OS/hardware combinations can be plugged into a device hive server, and so we can see test runs for BBC iPlayer Android across all the different variations. This will mean we can help target a wider range of Android devices for new BBC iPlayer features, which will help ease the anger that some of our audience members feel because their specific Android iPlayer experience is not as good as the later models.

In October I moved departments from Television and Mobile Platforms to POD Test, and joined the new Device Hive team as lead developer. I am working in Ruby/Rails/Rspec/Cucumber and using Ubuntu Linux VMs and lots of Android and iOS devices to build up the system.

University Engagement

I have been continuing to work with Manchester University’s Ultimate Programming Society to organise and present talks to the students about working practices in software development that the BBC use. We have covered Behaviour Driven Development, Test Driven Development, Editors and IDEs and Agile Development Practices so far.

Generally I feel that I have worked on some pretty challenging projects this year, and I am very happy with being the lead developer on Device Hive, and look forward to making this project as useful and as powerful as I think it can be.

Android Debug Bridge failing to detect emulators under OSX


I’ve been working on a project at the BBC where we are using the Android command-line tools from the Android Developer Tools, to spin up and terminate series of emulators. I noticed a big problem under OSX where ‘adb devices’ was failing to register emulators occasionally when we started them up, without any error message, even though they were loaded and quite clearly running in a window on OSX. This was a real problem for our project because we needed absolute parity between emulator process being launched and subsequently being detected by adb.

We switched to using adb with emulators in an Ubuntu 12.04 VM running under OSX, and we had no further problems with our setup. Emulators will now be programatically launched and torn down by our monitoring application. We now have an array of emulators which we can deploy to at will, which is very useful.

I don’t know what has caused this problem, my only hunch is that the Android toolkit was probably developed in a very Linux-heavy environment, and so adb on Linux was probably their first testing platform. All I can say is that Linux is much more stable than OSX, even as a VM, for Android emulation.

The Haiku Machine


I found this awesome cut-up poetry generator, which takes the text of famous poets and builds structured poetry out of it. The guy that made it even developed the underlying algorithm as a research project. I have put a version of a free Amazon EC2 instance, wrote a little twitter bot in node.js, and wired the poetry generator with the twitter bot, and now I have this: https://twitter.com/haikumachine – a twitter bot that posts a haiku every five minutes, derived from Dylan Thomas’s poetry.

It could be improved, and there are sometimes erroneous tweets where the syllables aren’t counted quite right, or some of the punctuation doesn’t make sense once cut up, but damnit, it’s a bot that writes Haikus.


256 Color VIM on Crunchbang Waldorf

256 Colours in VIMTo get 256 colors working within terminator in Crunchbang Waldorf, I had to do the following:

  1. Add to ~/.bashrc
    export TERM=xterm-256color
  2. Install a 256 color VIM colorcheme, see desert256 for example.
  3. Add the following to ~/.vimrc:
    set t_Co=256
    set t_AB=^[[48;5;%dm
    set t_AF=^[[38;5;%dm

    ‘t_Co’ specifies exactly how many colours VIM can use. The other two lines seem to be Debian-specific color code escape sequences.

  4. If you want 256 color VIM for your root user when you sudo edit, then edit /usr/share/vim/vimrc and copy across your settings from your local ~/.vimrc and ~/.vim to this global environment.

Subversion 1.7 on Crunchbang Waldorf

I use the excellent http://www.smartsvn.com/ client from WANdisco. WANdisco have been releasing new open-source versions of SVN to the public with new improved reliability, and the client uses one of these versions, 1.7, to offer better performance.

Unfortunately if you choose to upgrade your entire repository to 1.7, this breaks compatibility with the default commandline SVN client on Waldorf which I like to use as well as Smart SVN, for quick ‘svn up’s and other commandline magic.

This means I have to download the latest commandline SVN client, the 1.7 version of subversion for Linux, available for free on the WANdisco site.

Unfortunately, you can’t install this version on the version of Debian that Crunchbang Waldorf is based on. There are broken dependencies on an old version of libsvn1, which is a requirement for another package that is part of the Debian base install.

Eventually I found this really helpful page, the instructions which will work 100% on Waldorf:


iTerm for OSX for a Colourful Terminal Experience

Screen Shot 2013-06-11 at 17.06.36

iTerm is much better than the standard OSX terminal client, not least because it has compatiablity with xterm256-color terminal emulation. xterm256-color emulation will give your terminal access to 256 colours instead of the usual 16. Much better, not just for looking pretty, but for distinguishing between different types of data in an editor like VIM or even in Cucumber output (see picture above). It’s also free.


Once installed, you will have to go in to the preferences and set your ‘Report Terminal Type’ to be ‘xterm256-color’. Then things should be more colourful. Then install a 256 color compatible theme in VIM to make use of that extra capacity. You can also edit your prompt and use 256 colour escape sequences, if you wish.

Tailing a log file and Running an Application at the Same Time


A quick tip this, but a useful one. You can tail a log file in the background while running a script in the foreground. So for example, I frequently execute the following commands:


tail -f /var/log/httpd.log &
/etc/init.d/apache restart

2. (The log file will spool onto the terminal as Apache is restarted.)

3. Once you are finished viewing the log file, foreground the log file process and kill it:


Then terminate the foregrounded log tail with a control-c.

With this technique you can run as many commands as you want, and see the real-time effects on your log file, without having to open a new terminal. You will also see your program output interspersed with your log file output, which can be helpful when tracing down particular problems.

Monitoring a Slow Internet Connection in OSX


I am currently on holiday in Tenerife, and although I really like it here, one thing I do not like is the internet connection we have in our resort. Sometimes networked applications will just hang with no warning and there will be minutes where it’s not clear what is going on. Here are some ways you can find a little bit more about what is happening when an application is slow or seems to hang when you have a poor internet connection. Execute the following commands each in a separate terminal window.

Log Files

tail -f /var/log/*

This will give you an indication of what is happening in OSX. For example, I was installing the XCode Command Line Utils from within XCode. The installation progress information is severely lacking, it just shows a bar which moves from left to right. However I was able to find out what was happening by tailing the log files in /var/log, which provided me with an updated breakdown of the installer progress. You can exit tail by using Control-C which will return you to the shell.

Constant Ping

ping www.google.com

When I have problems with my internet connection, I always keep a ping running in the background in a terminal. The interesting information here is the ICMP RTT time shown as the milliseconds next to the ‘time’ label, and how many packets were dropped shown by the number of ‘request timeout’ messages. Google does not mind you pinging it, just like hundreds of thousands of other people do, and so you can keep this up constantly, monitoring problems with your internet connection. When you get ‘no route to host’ printed, this usually means that your gateway or wireless connection is down, which means you usually have to reestablish a connection manually.


sudo tcpdump en0 -vvv

Do you really want to see what is happening on your computers network connection? Turn the floodgates on then, and use tcpdump. This will output information on each packet that your computer sends out and is received in a slightly Matrix-style torrent of information. If you are downloading something via an application or have a number of active web connections such as AJAX Facebook pages loaded, you would expect to see a lot of traffic. If you don’t have a lot of traffic, and you’re expecting a lot, then something may be wrong. You can use tcpdump to get a general feel of what data is being passed around, and to what IP address, which you can then look up later for more clarity. You can also use grep and some basic TCP/IP networking knowledge to find out what exactly is happening on the network level.

Network Connection Status of Each Application

sudo lsof -i tcp

Want to find out information about applications are using your internet connection, and the connection state of each TCP connection? Use lsof. You will be given the name of the application that is using each TCP connection, the IP address to which it is connected to, and the TCP connection state (established is good, time wait can be a problem sign). Run this regularly to check on the connection state of your programs. This won’t monitor UDP connections, but should cover your web browsers.

Hopefully this information will give you a bit more insight into what is actually happening on your OSX machine when your internet connection is being unreliable and you want more information about what is going on. Once you have this information, you can use it to inform actions such as toggling the wireless off and on again to reestablish a connection, reloading webpages that have hung, restarting application downloads, or possibly finding a new hotel or resort with a better internet connection 🙂

TDD Talk


Recently, myself and two colleagues from the BBC, ran a session on Test Driven Development at the Manchester University ‘Ultimate Programming’ society. The society is a gathering where students discuss cool things they have done with programming, and occasionally have guest speakers from industry. I found the society online and thought it would be great to get the BBC more involved in the local university happenings.

It is the first outreach project that I have undertaken, and it required a lot of preparation. Our initial idea was to get students to implement the A* search algorithm in a practical session, using TDD. However after we had all implemented our own copy of the algorithm, and realised it had taken several hours each, we realised we’d not have enough time in the 2 hour slot that we had.

Instead then, we went back to the tried and tested FizzBuzz example, which is how I learned TDD at the BBC. This was nice and simple and relatively straightforward to implement in an hour practical session. The task was to implement FizzBuzz using write-the-tests-first TDD process, and we gave approximately 1 hour for the students to undertake this task. For the rest of the time we were going through our presentation and talking about how we use TDD and other development concepts at the BBC.

The session overall went quite well, and although it showed to me how difficult it is to present in front of a group of people for an hour, we had good feedback, and I think we really gave the students a different perspective on how to write code, one that a large section of them would not have been exposed to in a standard CS curriculum. We aim to do other talks, starting with the next one, which will be a session on how we use BDD (Behaviour Driven Development) at the BBC.

Here are the slides we put together for the presentation:

Here is the model solution to the TDD exercise, written by my colleague Jack Palfry:

Converting a single M2V frame into JPEG under OSX

I needed to view a single frame of a m2v file that had been encoded by our designers for playing out on TV. The file name was .mpg but in actuality it was a single .m2v frame renamed to be a .mpg. Windows Media Player classic used to display the frame fine when I opened the file normally, under Windows XP. However now I have switched to a Mac, I have found that Quicktime and VLC refused to display the single frame. I couldn’t find a video player that would open the single frame. So I resorted to the command line version of ffmpeg, which I installed via macports, to convert this single frame to a jpg file to view as normal. This line worked a treat:

ffmpeg -i north.mpg -ss 00:00:00 -t 00:00:1 -s 1024x768 -r 1 -f mjpeg north.jpg

Where ‘north.mpg’ was the m2v file, and ‘north.jpg’ was the output jpeg.

And this:

find -name *.mpg -exec ffmpeg -i {} -ss 00:00:00 -t 00:00:1 -s 1024x768 -r 1 -f mjpeg {}.jpg ;

Will go through all the mpg files in the current directory and below, and create their jpeg single frame equivalents, ie: for north.mpg it will create north.mpg.jpg.

Java 1.6 on RHEL4

After I wrote a Java application in JDK 1.6, I was stuck for a while when I realised that the target deployment machine was Red Hat Enterprise Linux 4. RHEL4 does not support Java 1.6 in its default configuration.

Luckily I found this article on the CentOS wiki which included instructions on how to install Java 1.6 on CentOS 4. Remembering that RHEL4 and CentOS 4 are almost identical, I tried the method supplied, and it worked. This is the page with the method:


Test Driven Systems Development with Nagios

Nagios can be seen as a automated test tool for systems, just as you would have automated tests for software projects. In test driven development (TDD), you write the tests first, and then use those tests to build up a software project that you can have confidence that it works. We can use this method to build up systems, or networks of systems. Plan out which services and processes should be running on your new systems, and then implement Nagios tests for every one. You can check the progress of your build by checking Nagios. I have been doing this at the BBC. It is a simple idea but one that seems to work.

JSoup Method for Page Scraping

Soup bowl

I’m currently in the process of writing a web scraper for the forums on Gaia Online. Previously, I used to use Python to develop web scrapers, with the very handy Python library BeautifulSoup. Java has an equivalent called JSoup.

Here I have written a class which is extended by each class in my project that wants to scrape HTML. This ‘Scraper’ class deals with the fetching of the HTML and converting it into a JSoup tree to be navigated and have the data picked out of. It advertises itself as a ‘web spider’ type of web agent and also adds a 0-7 second random wait before fetching the page to make sure it isn’t used to overload a web server. It also converts the entire page to ASCII, which may not be the best thing to do for multi-language web pages, but certainly has made the scraping of the English language site Gaia Online much easier.

Here it is:

import java.io.IOException;
import java.io.InputStream;
import java.io.StringWriter;
import java.text.Normalizer;
import java.util.Random;
import org.apache.commons.io.IOUtils;
import org.apache.http.HttpEntity;
import org.apache.http.HttpResponse;
import org.apache.http.client.HttpClient;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.DefaultHttpClient;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;

* Generic scraper object that contains the basic methods required to fetch
* and parse HTML content. Extended by other classes that need to scrape.
* @author David
public class Scraper {

        public String pageHTML = ""; // the HTML for the page
        public Document pageSoup; // the JSoup scraped hierachy for the page

        public String fetchPageHTML(String URL) throws IOException{

            // this makes sure we don't scrape the same page twice
            if(this.pageHTML != ""){
                return this.pageHTML;

            System.getProperties().setProperty("httpclient.useragent", "spider");

            Random randomGenerator = new Random();
            int sleepTime = randomGenerator.nextInt(7000);
                Thread.sleep(sleepTime); //sleep for x milliseconds
            }catch(Exception e){
                // only fires if topic is interruped by another process, should never happen

            String pageHTML = "";

            HttpClient httpclient = new DefaultHttpClient();
            HttpGet httpget = new HttpGet(URL);

                HttpResponse response = httpclient.execute(httpget);
                HttpEntity entity = response.getEntity();

                if (entity != null) {
                    InputStream instream = entity.getContent();
                    String encoding = "UTF-8";

                    StringWriter writer = new StringWriter();
                    IOUtils.copy(instream, writer, encoding);

                    pageHTML = writer.toString();
                    // convert entire page scrape to ASCII-safe string
                    pageHTML = Normalizer.normalize(pageHTML, Normalizer.Form.NFD).replaceAll("[^\p{ASCII}]", "");


                return pageHTML;

        public Document fetchPageSoup(String pageHTML) throws FetchSoupException{
            // this makes sure we don't soupify the same page twice
            if(this.pageSoup != null){
                return this.pageSoup;
                throw new FetchSoupException("We have no supplied HTML to soupify.");

            Document pageSoup = Jsoup.parse(pageHTML);

            return pageSoup;

Then each class subclasses this scraper class, and adds the actual drilling down through the JSoup hierachy tree to get what is required:

this.pageHTML = this.fetchPageHTML(this.rootURL);
this.pageSoup = this.fetchPageSoup(this.pageHTML);

// get the first  section on the page
Element forumPageLinkSection = this.pageSoup.getElementsByAttributeValue("id","forum_hd_topic_pagelinks").first();
// get all the links in the above 
section Elements forumPageLinks = forumPageLinkSection.getElementsByAttribute("href"); ...

I’ve found that this method provides a simple and effective way of scraping pages and using the resultant JSoup tree to pick out important data.

Disabling Control-Enter and Control-B shortcut keys in Outlook 2003

At work, I still have to use Windows XP and Outlook 2003. I don’t particually mind this, except when I draft an email to someone and accidently I press Control-B instead of Control-V. Control-B will go ahead and send your partially composed email, resulting in some embarassment as you have to tell everyone to disregard it.

So I wanted to remove the ‘send email’ shortcut keys in Outlook 2003. There are two ways of doing this, one involves editing your group policy, which is something only my IT administration team can do, and I didn’t want to have to involve them. The other way is by making a change to your registry, which I will describe here.

  1. Open up regedit, and browse to the following registry key: HKEY_CURRENT_USER -> Software -> Policies -> Microsoft -> office -> 11.0 -> outlook
  2. Then create a new key called: “DisabledShortcutKeysCheckBoxes”.
  3. Under that key, create two new String Values:
    Name: CtrlB Data: 66,8
    Name: CtrlEnter Data: 13,8
  4. Then restart Outlook and those keys will be disabled.

Click on the thumbnail below to see what the finished edit should look like:

Directory names not visable under ls? Change your colours.

There is a problem I frequently encouter on Redhat/Fedora/CentOS systems with the output of the ls command. Under those distributions, the default setup is to display directories in a very dark colour. If you usually use a white foreground and a black background on your terminal client (such as Putty) then you will struggle to read the names of the directories under Redhat-based distributions.

There are two soloutions that I have used:

1. Change the colour settings in Putty

If you use Putty, ticking ‘Use System Colours’ here changes the “white foreground, black background” default into a “white background, black foreground”. This way you can at least read the console properly, good for a quick fix. You can also save these settings in putty to be the default for the host that you are connecting to, or even all hosts.

2. Change the LS_COLORS directive temporarily in the shell.

Alternatively, you can ask the ls command to display directories and other entries in colours that you specify. You could add these lines to the bottom of your .bashrc to make these changes permanent, or if you are using a shared machine, just copy and paste the following lines into the terminal and they will change the colours to a reddish more visable set, until you logout. :

alias ls='ls --color' # just to make sure we are using coloured ls
export LS_COLORS

(Original source for this particular LS_COLORS combo: http://linux-sxs.org/housekeeping/lscolors.html)

Scraping Gumtree Property Adverts with Python and BeautifulSoup

I am moving to Manchester soon, and so I thought I’d get an idea of the housing market there by scraping all the Manchester Gumtree property adverts into a MySQL database. Once in the database, I could do things like find the average monthly price for a 2 bedroom flat in an area, and spot bargains through using standard deviation from the mean on the price through using simple SQL queries via phpMyAdmin.

I really like the Python library BeautifulSoup for writing scrapers, there is also a Java version called JSoup. BeautifulSoup does a really good job of tolerating markup mistakes in the input data, and transforms a page into a tree structure that is easy to work with.

I chose the following layout for the program:

advert.py – Stores all information about each property advert, with a ‘save’ method that inserts the data into the mysql database
listing.py – Stores all the information on each listing page, which is broken down into links for specific adverts, and also the link to the next listing page in the sequence (ie: the ‘next page’ link)
scrapeAdvert.py – When given an advert URL, this creates and populates an advert object
scrapeListing.py – When given a listing URL, this creates and populates a listing object
scrapeSequence.py – This walks through a series of listings, calling scrapeListing and scrapeAdvert for all of them, and finishes when there are no more listings in the sequence to scrape

Here is the MySQL table I created for this project (which you will have to setup if you want to run the scraper):

-- Database: `manchester`

-- --------------------------------------------------------

-- Table structure for table `adverts`

  `url` varchar(255) NOT NULL,
  `title` text NOT NULL,
  `pricePW` int(10) unsigned NOT NULL,
  `pricePCM` int(11) NOT NULL,
  `location` text NOT NULL,
  `dateAvailable` date NOT NULL,
  `propertyType` text NOT NULL,
  `bedroomNumber` int(11) NOT NULL,
  `description` text NOT NULL,
  PRIMARY KEY (`url`)

PricePCM is price per calendar month, PricePW is price per week. Usually each advert with have one or the other specified.


import MySQLdb
import chardet
import sys

class advert:

        url = ""
        title = ""
        pricePW = 0
        pricePCM = 0
        location = ""
        dateAvailable = ""
        propertyType = ""
        bedroomNumber = 0
        description = ""

        def save(self):
                # you will need to change the following to match your mysql credentials:

                self.description = unicode(self.description, errors='replace')
                self.description = self.description.encode('ascii','ignore')
                # TODO: might need to convert the other strings in the advert if there are any unicode conversetion errors

                sql = "INSERT INTO adverts (url,title,pricePCM,pricePW,location,dateAvailable,propertyType,bedroomNumber,description) VALUES('"+self.url+"','"+self.title+"',"+str(self.pricePCM)+","+str(self.pricePW)+",'"+self.location+"','"+self.dateAvailable+"','"+self.propertyType+"',"+str(self.bedroomNumber)+",'"+self.description+"' )"


In advert.py we convert the unicode output that BeautifulSoup gives us into plain ASCII so that we can put it in the MySQL database without any problems. I could have used Unicode in the database as well, but the chances of really needing Unicode for representing Gumtree ads is quite slim. If you intend to use this code then you will also want to enter the MySQL credentials for your database.


class listing:


        def addAdvertURL(self,url):



from BeautifulSoup import BeautifulSoup          # For processing HTML
import urllib2
from advert import advert
import time

class scrapeAdvert:

        page = ""
        soup = ""

        def scrape(self,advertURL):

                # give it a bit of time so gumtree doesn't
                # ban us

                url = advertURL
                # print "-- scraping "+url+" --"
                page = urllib2.urlopen(url)
                self.soup = BeautifulSoup(page)

                self.anAd = advert()

                self.anAd.url = url
                self.anAd.title = self.extractTitle()
                self.anAd.pricePW = self.extractPricePW()
                self.anAd.pricePCM = self.extractPricePCM()

                self.anAd.location = self.extractLocation()
                self.anAd.dateAvailable = self.extractDateAvailable()
                self.anAd.propertyType = self.extractPropertyType()
                self.anAd.bedroomNumber = self.extractBedroomNumber()
                self.anAd.description = self.extractDescription()

        def extractTitle(self):

                location = self.soup.find('h1')
                string = location.contents[0]
                stripped = ' '.join(string.split())
                stripped = stripped.replace("'",'"')
                # print '|' + stripped + '|'
                return stripped

        def extractPricePCM(self):

                location = self.soup.find('span',attrs={"class" : "price"})
                        string = location.contents[0]
                except AttributeError: # for ads with no prices set
                        return 0
                except ValueError: # for ads with pw specified
                        return 0

                stripped = string.replace('£','')
                stripped = stripped.replace('pcm','')
                stripped = stripped.replace(',','')
                stripped = stripped.replace("'",'"')
                stripped = ' '.join(stripped.split())
                # print '|' + stripped + '|'
                return int(stripped)

        def extractPricePW(self):

                location = self.soup.find('span',attrs={"class" : "price"})
                        string = location.contents[0]
                except AttributeError: # for ads with no prices set
                        return 0
                except ValueError: # for ads with pcm specified
                        return 0
                stripped = string.replace('£','')
                stripped = stripped.replace('pw','')
                stripped = stripped.replace(',','')
                stripped = stripped.replace("'",'"')
                stripped = ' '.join(stripped.split())
                # print '|' + stripped + '|'
                return int(stripped)

        def extractLocation(self):

                location = self.soup.find('span',attrs={"class" : "location"})
                string = location.contents[0]
                stripped = ' '.join(string.split())
                stripped = stripped.replace("'",'"')
                # print '|' + stripped + '|'
                return stripped

        def extractDateAvailable(self):

                current_year = '2011'

                ul = self.soup.find('ul',attrs={"id" : "ad-details"})
                firstP = ul.findAll('p')[0]
                string = firstP.contents[0]
                stripped = ' '.join(string.split())
                date_to_convert = stripped + '/'+current_year
                        date_object = time.strptime(date_to_convert, "%d/%m/%Y")
                except ValueError: # for adverts with no date available
                        return ""

                full_date = time.strftime('%Y-%m-%d %H:%M:%S', date_object)
                # print '|' + full_date + '|'
                return full_date

        def extractPropertyType(self):

                ul = self.soup.find('ul',attrs={"id" : "ad-details"})
                        secondP = ul.findAll('p')[1]
                except IndexError: # for properties with no type
                        return ""
                string = secondP.contents[0]
                stripped = ' '.join(string.split())
                stripped = stripped.replace("'",'"')
                # print '|' + stripped + '|'
                return stripped

        def extractBedroomNumber(self):

                ul = self.soup.find('ul',attrs={"id" : "ad-details"})
                        thirdP = ul.findAll('p')[2]
                except IndexError: # for properties with no bedroom number
                        return 0
                string = thirdP.contents[0]
                stripped = ' '.join(string.split())
                stripped = stripped.replace("'",'"')
                # print '|' + stripped + '|'
                return stripped

        def extractDescription(self):

                div = self.soup.find('div',attrs={"id" : "description"})
                description = div.find('p')
                contents = description.renderContents()
                contents = contents.replace("'",'"')
                # print '|' + contents + '|'
                return contents

In scrapeAdvert.py there are a lot of string manipulation statements to pull out any unwanted characters, such as the ‘pw’ characters (short for per week) found in the price string, which we need to remove in order to store the property price per week as an integer.

Using BeautifulSoup to pull out elements is quite easy, for example:

ul = self.soup.find('ul',attrs={"id" : "ad-details"})

That finds all the HTML elements under the tag id=”ad-details”, so all the list elements in that list. More detail can be found in the Beautiful Soup documentation which is very good.


from BeautifulSoup import BeautifulSoup          # For processing HTML
import urllib2
from listing import listing
import time

class scrapeListing:

        soup = ""
        url = ""
        aListing = ""

        def scrape(self,url):
                # give it a bit of time so gumtree doesn't
                # ban us

                print "scraping url = "+str(url)

                page = urllib2.urlopen(url)
                self.soup = BeautifulSoup(page)

                self.aListing = listing()
                self.aListing.url = url
                self.aListing.adverturls = self.extractAdvertURLs()
                self.aListing.nextLink = self.extractNextLink()

        def extractAdvertURLs(self):

                toReturn = []
                h3s = self.soup.findAll("h3")
                for h3 in h3s:
                        links = h3.findAll('a',{"class":"summary"})
                        for link in links:
                                print "|"+link['href']+"|"

                return toReturn

        def extractNextLink(self):

                links = self.soup.findAll("a",{"class":"next"})
                        print ">"+links[0]['href']+">"
                except IndexError: # if there is no 'next' link found..
                        return ""
                return links[0]['href']

The extractNextLink method here extracts the pagination ‘next’ link which will bring up the next listing page from the selection of listing pages to browse. We use it to step through the pagination ‘sequence’ of resultant listing pages.


from scrapeListing import scrapeListing
from scrapeAdvert import scrapeAdvert
from listing import listing
from advert import advert
import MySQLdb
import _mysql_exceptions

# change this to the gumtree page you want to start scraping from
url = "http://www.gumtree.com/flats-and-houses-for-rent/salford-quays"

while url != None:
        print "scraping URL = "+url
        sl = ""
        sl = scrapeListing()
        for advertURL in sl.aListing.adverturls:
                sa = ""
                sa = scrapeAdvert()
                except _mysql_exceptions.IntegrityError:
                        print "** Advert " + sa.anAd.url + " already saved **"
                sa.onAd = ""

        url = ""
        if sl.aListing.nextLink:
                print "nextLink = "+sl.aListing.nextLink
                url = sl.aListing.nextLink
                print 'all done.'

This is the file you run to kick off the scrape. It uses an MySQL IntegrityError try/except block to pick out when an advert has already been entered into the database, this will throw an error because the URL of the advert is the primary key in the database. So no two records can have the same primary key.

The URL you provide it above gives you the starting page from which to scrape from.

The above code worked well for scraping several hundred Manchester Gumtree ads into a database, from which point I was able to use a combination of phpMyAdmin and OpenOffice Spreadsheet to analyse the data and find out useful statistics about the property market in said area.

Download the scraper source code in a tar.gz archive

Note: Due to the nature of web scraping, if – or more accurately, when – Gumtree changes its user interface, the scraper I have written will need to be tweaked accordingly to find the right data. This is meant to be an informative tutorial, not a finished product.

RESTful Web Services

Hammock with the background of a clear blue sky

REST (Representational State Transfer) is a way of delivering web services. When a web service conforms to REST, it is known as RESTful. The largest RESTful web service is the Hypertext Transfer Protocol (HTTP) which you use every day to send and receive information from web servers while browsing the internet.

To implement RESTful web services, you should implement four methods: GET, PUT, POST and DELETE. Resources on RESTful web services are typically defined as collections of elements. The REST methods can either act on a whole collection, or a specific element in a collection.

A collection is usually logically defined as a hierarchy on the URL, for example take this fictitious layout:

Collection: http://www.bbc.co.uk/iplayer/programmes/
Element: http://www.bbc.co.uk/iplayer/programmes/24
Element: http://www.bbc.co.uk/iplayer/programmes/25
Element: http://www.bbc.co.uk/iplayer/programmes/26

The REST methods you use do different things depending on whether you are interacting with a Collection resource or an Element resource. See below:

On a Collection: ie: http://www.bbc.co.uk/iplayer/programmes/
GET – Lists the URLs of the collection’s members.
PUT – Replace the entire collection with another collection.
POST – Create a new element in a collection, returning the new element’s URL.
DELETE – Deletes the entire collection.

On an Element: ie: http://www.bbc.co.uk/iplayer/programmes/24
GET – Retrieve the addressed element in the appropriate internet media type, ie: music file or image
PUT – Replace the addressed element of the collection, or if it doesn’t exist, create it in the parent collection.
POST – Treat the addressed element of the collection as a new collection, and add an element into it.
DELETE – Delete the addressed element of the collection.

REST is a simple and clear way of implementing the basic methods of data storage; CRUD (Create, Read, Update and Delete), see: http://en.wikipedia.org/wiki/Create,_read,_update_and_delete

‘Weather Forecast’ Calendar Service in PHP

The BBC provide 3 day weather RSS feeds for most locations in the UK. I thought it would be interesting to create a web service to turn the weather feed into calendar feed format, so I could have a constantly updated forecast of the next 3 days of weather mapped on to my iPhone’s calendar. Here it is on my iPhone:

Picture shows weather forecast on an iPhone calendar screenshot


The service is separated into five files:

  • ical.php – this contains the class ical which corresponds to a single calendar feed. A method called ‘addevent’ allows you to add new events to the calendar, and a method called ‘returncal’ redirects the resulting calendar file to the browser so people can subscribe to it using their calendar application.
  • forecast.php – this file contains the class forecast, which has properties for all aspects that we want to record for each day’s forecast, ie: Wind Speed and Humidity. It also contains the forecast set, which is a collection of forecast objects. The set class is serializable, which means each forecast object can be stored in a text file, including the Wind Speed, Humidity and all other things we want to record for each day.
  • scrape-weather.php – this file contains code that scrapes the weather feed, populates the forecast set with all the weather information for the next 3 days, and stores the result in a file called forecasts.ser.
  • forecasts.ser – this is all the data for the three day weather forecast, in serialized format. It is automatically deleted and recreated when the scrape-weather.php script is run.
  • reader.php – this file converts the forecasts.ser file into an iCal calendar, and outputs the iCal formatted result to the calendar application that accesses reader.php page.

It uses two external libraries:

  • MagpieRSS 0.72 – this popular library is used for reading the calendar RSS feed and converting it into a PHP object that is easier to manipulate by scrape-weather.php.
  • iCalcreator 2.8 – this is used for creating the output iCal format of the calendar in ical.php and outputting it to the browser in reader.php.



	function init(){
		$config = array( 'unique_id' => 'weather.davidcraddock.net' );
		  // set Your unique id
		$this->v = new vcalendar( $config );
		  // create a new calendar instance

		$this->v->setProperty( 'method', 'PUBLISH' );
		  // required of some calendar software
		$this->v->setProperty( "x-wr-calname", "Calendar Sample" );
		  // required of some calendar software
		$this->v->setProperty( "X-WR-CALDESC", "Calendar Description" );
		  // required of some calendar software
		$this->v->setProperty( "X-WR-TIMEZONE", "Europe/London" );
		  // required of some calendar software

	function addevent($start_year,$start_month,$start_day,$start_hour,$start_min,
		$vevent = & $this->v->newComponent( 'vevent' );
		  // create an event calendar component
		$start = array( 'year'=>$start_year, 'month'=>$start_month, 'day'=>$start_day, 'hour'=>$start_hour, 'min'=>$start_min, 'sec'=>0 );
		$vevent->setProperty( 'dtstart', $start );
		$end = array( 'year'=>$finish_year, 'month'=>$finish_month, 'day'=>$finish_day, 'hour'=>$finish_hour, 'min'=>$finish_min, 'sec'=>0 );
		$vevent->setProperty( 'dtend', $end );
		$vevent->setProperty( 'LOCATION', '' );
		  // property name - case independent
		$vevent->setProperty( 'summary', $summary );
		$vevent->setProperty( 'description',$description );
		$vevent->setProperty( 'comment', $comment );
		$vevent->setProperty( 'attendee', 'contact@davidcraddock.net' );

	function returncal(){
		// redirect calendar file to browser
forecasts = new ArrayObject();

	function store(){
		$store_path = $this->store_path;
		file_put_contents($store_path, serialize($this->set));

	function scrapecurrent(){
		$url = $this->feed_url;
		$rss = fetch_rss( $url );
		$message = "";
		if(sizeof($rss->items) != 3){
			die("Problem with BBC weather feed.. dying");
		$set = new forecast_set();
		$curdate = date("Y-m-d");
		echo $curdate;
		foreach ($rss->items as $item) {
			$href = $item['link'];
			$title = $item['title'];
			$description = $item['description'];
			$curyear = date('Y',strtotime(date("Y-m-d", strtotime($curdate)) . " +1 day"));
			$curmonth = date('m',strtotime(date("Y-m-d", strtotime($curdate)) . " +1 day"));
			$curday = date('d',strtotime(date("Y-m-d", strtotime($curdate)) . " +1 day"));
			preg_match('/Min Temp:.+?-*d*/',$title,$mintemp);
			preg_match('/Max Temp:.+?-*d*/',$title,$maxtemp);
			preg_match('/Wind Speed:.+?-*d*/',$description,$windspeed);
			$summary[0] = str_replace(': ','',$summary[0]);
			$summary[0] = str_replace(',','',$summary[0]);
			$mintemp[0] = str_replace('Min Temp: ','',$mintemp[0]);
			$maxtemp[0] = str_replace('Max Temp: ','',$maxtemp[0]);
			$windspeed[0] = str_replace('Wind Speed: ','',$windspeed[0]);
			$humidity[0] = str_replace('Humidity: ','',$humidity[0]);
			$mins[$i] = (int)$mintemp[0];	
			$maxs[$i] = (int)$maxtemp[0];
			$forecast = new forecast();
			$forecast->low = (int)$mintemp[0];
			$forecast->high = (int)$maxtemp[0];
			$forecast->year = (int)$curyear;
			$forecast->month = (int)$curmonth;
			$forecast->day = (int)$curday;
			$forecast->windspeed = $windspeed[0];
			$forecast->humidity = $humidity[0];
			$forecast->summary = ucwords($summary[0]);
			$curdate = date('Y-m-d',strtotime(date("Y-m-d", strtotime($curdate)) . " +1 day"));
		$this->set = $set;


$s = new scrape3day();

$c = new ical();
$f = unserialize(file_get_contents('forecasts.ser'));
	$weather_digest = "Max: ".$curforecast->high." Min: ".$curforecast->low." Humidity: ".$curforecast->humidity."% Wind Speed: ".$curforecast->windspeed."mph.";

SVN Version

If you have subversion, you can check out the project from: http://svn.davidcraddock.net/weather-services/. There are a couple extra files in that directory for my automated freezing weather alerts, but you can safely ignore those.


You will have to add this entry to your crontab to run once per day. You could set the script to run at midnight through adding the following:

0 0 * * *  

For example, in my case:

0 0 * * * /usr/local/bin/php /home/david_craddock/work.davidcraddock.net/weather/scrape-weather.php 

You will then need to edit the contents of the $store_path and $feed_url variables in scrape-weather.php. Store_path should refer to a file path that the web server can create and edit files in, and feed_url should refer to the RSS feed of your local area that you have copied and pasted from the http://news.bbc.co.uk/weather/ site, don’t use mine because your area is likely different. After that, you’re set to go.

Find large files by using the OSX commandline

To quickly find large files to delete if you have filled your startup disk, enter this command on the OSX terminal:

sudo find / -size +500000 -print

This will find and print out file paths to files over 500MB. You can then go through them and delete them individually by typing rm “<file path>”, although there is no undelete so make sure you know you won’t miss them.

Finding files in Linux modified between two dates

You use the ‘touch’ command to create two blank files, with a last modified date that you specify – one with a date of the start of the range you want to specify, and the second with a date at the end of the range you want to specify. Then you reference to those two files in your find command:

touch /tmp/temp -t 200604141130
touch /tmp/ntemp -t 200604261630
find /data/ -cnewer /tmp/temp -and ! -cnewer /tmp/ntemp

Writing simple email alerts in PHP with MagpieRSS

I wrote an email alerter that sends me an email whenever the upcoming temperature may dip below freezing. It uses the Magpie RSS reader to pull down a 3 day weather forecast that is provided for my area in RSS form by the BBC weather site. It then parses this forecast and determines if either today’s or tomorrow’s weather may dip below freezing. If it might, it sends an email to my email address to warn me.

I scheduled this script to run every day by adding it as a daily cron job on my web host. You can set this up for any web hosts that support cron jobs.

items) != 3){
                $message .= 'Error: problem parsing BBC weather feed';
        foreach ($rss->items as $item) {
                $href = $item['link'];
                $title = $item['title'];
                preg_match('/Min Temp:.+?-*d*/',$title,$mintemp);
                preg_match('/Max Temp:.+?-*d*/',$title,$maxtemp);
                $mintemp[0] = str_replace('Min Temp: ','',$mintemp[0]);
                $maxtemp[0] = str_replace('Max Temp: ','',$maxtemp[0]);
                $mins[$i] = (int)$mintemp[0];
                $maxs[$i] = (int)$maxtemp[0];

        // freezing warnings

        if($mins[0] < 0){
                $message .= "Today's temperature in W3 may go below freezing, anything down to ".$mins[0];

You can right click on this link and ‘save as’ to download the script.

Reverting back to a previous version in CVS – the magic “undo” feature

If you’ve committed some code into to CVS, and made a mistake on that commit, you will want to know how to revert to a previously saved version. Here is the command line command for CLI versions of CVS:

$ cvs update -D '1 week ago'

Run this command in the main directory of your checked out working copy. This will revert your working copy to the version of the code that was checked in ‘1 week ago’ from the present date. You also use commands like “1 day ago” and “5 days ago”.

Then simply commit the changes with a log message:

$ cvs commit -m "Oops! Made a mistake, had to revert back to the 21/1/2011 version"

Netbeans for simple Java GUI Applications

I’ve been writing some simple Java GUI applications using the Netbeans IDE. It allows you to quickly make event-driven GUI applications, and generates a lot of skeleton code that you’ll need, but don’t necessarily want to type out. It reminds me of the IDE designer of Visual Basic 6, which allowed you to mock up simple GUIs with code in almost no time at all, although the VB language itself often proved difficult. With Netbeans you are using Java, and so you can make some powerful software with little effort.

Converting week numbers to dates

Here is some python code I adapted from this stackoverflow post to get the first day of a week specificed by a week number. This method includes leap year and summer time differences.

import time
def weeknum(num,year):
	instr = str(year)+" "+str(num-1)+" 1"
	print time.asctime(time.strptime(instr,'%Y %W %w'))

Here is me exectuting the code in Python’s IDLE shell:

See that the first week of 2009 actually started in 2008, but by the end of that week we are in 2009.

MediaMonkey allows you to transfer music from any computer onto your guest iPhone

MediaMonkey is a popular free media player for Windows. It has a great feature that allows you to transfer to and from an iPhone that is not registered with your computer. Normally only one iTunes install can be associated with your iPhone, but MediaMonkey allows you another way to transfer music and audio files with a ‘guest’ iPhone. Check it out, it works:


Applications I Reccomend

Software I use on my macbook & PC:

DVDRipper Pro for Mac – DVD ripping, can also rip to ISO
Handbrake for Mac – Transcoding from DVD rip to iPhone-playable file
iMovie for Mac – Video editing
BabasChess for Windows – Best chess client for internet play
Hypercam 2 – Best screencapture utility
Skype for both – For reliable messaging as well as voice and video chat
Virtual Clone Drive for Windows – For mounting ISO images
iTunes for Mac – Best music player, and keeps media synced with iPhone
VLC Player for both – For watching movies
DVD Player for Mac – For watching DVDs

iPhone Apps:

Skype – Best messenger
iBooks – Best ebook reader
London Buses – Best London transport router, can route via tube, bus, cycle path and foot
Tube Status – Displays the status of all lines, with any disruptions summarised
NextBuses – Great app that gives you lots of info on the buses and bus stops in your area.
Apple Remote – Apple remote, allows you to control the music on any wi-fi linked iTunes library
Chess.com’s Chess – Great chess game for vs. computer play
TasteKid – Type in a film, author, tv series.. and it will give you similar recommendations
Google Earth – Brilliant for navigational help, although I use iPhone’s inbuilt Maps first, for most things.
SomaFM – Chilled out relaxing electronica

Recording Game Videos on Windows 7

This is just a quick note to remind myself how I did this.

  • Hypercam2 is a good, free, video recorder that can cope with recording game videos. It’s freely available from http://www.hyperionics.com/hc/downloads.asp – just make sure when you install it you don’t tick on the spyware toolbar installation options.
  • My motherboard has a 5.1 digital soundcard built in. However the only way I can record off the soundcard is to plug in a standard audio cable from the speaker out (green) to the microphone in (orange).
  • The soundcard switches off the headphone output when it detects a speaker attached to the speaker out, so you have to go to the recording options in Windows 7 and right click on the microphone in. It will give you an option to ‘Monitor this input using the headphones’ – which will allow you to listen to anything coming into the microphone socket through the headphone socket on the front on my PC.
  • In hypercam, set the sound to record from the default input device, set the frame rate to 10/10
  • Record using the ‘select window to record from’ option, select the game window, and use the F2 button to start and stop the recording.
  • The video will be output in AVI format, but you can transcode or convert it into a quicktime MOV file for editing in iMovie, or you can use windows movie editor, which is free and quite good.

Insights into a modern Indie Music label

I read this remarkable post on a public mailing list I subscribe to. I thought it was such a great insight into running a music label, that I just had to post it here. It discusses issues facing modern music, such as DRM, DMCA, and other ways of making (or losing) money. Fascinating.

Here it is:

I work for a (fairly small) indie label – from witnessing this model in action I feel I have to stick up for the label given that I see the model working (or sometimes not so well) on a daily basis! Where we’ve done deals with artists in the past, they’ve almost always been a 50/50 arrangement – the artist receives 50% of net royalties. Where a label fronts recording costs, these can easily become £6-10,000 for an album session. Even an EP session can be upwards of £1,500 although these figures are a little pessimistic (though not unrealistic). (We actually designed, built and owned studios for ten years until 2001 but the project haemorrhaged money.)

With regards to CD pressing, a 1,000 run will cost around £800 including full colour print in a basic jewel case. The AP1/AP2a MCPS licence costs another amount on top. When getting your CDs pressed, add in other things (Super Jewel cases, slip / O-cards, digipaks or gatefolds with high quality card / fancy posters) and you can easily top the 1k mark, not even counting the artwork design costs. Of course, discount comes with with bulk, but almost nobody except the Big Four do >1k discs in a pressing. (To put things in perspective: when SyCo have done the X Factor Finalists CDs, they press up >10,000 of EACH finalist’s recording of the song – and shred the losers’ copies when the winner is announced!)

To put stuff into distro with someone like Universal, you have your line costs simply to have the title listed on their system – monthly recurring, per title – then handling costs, despatch costs, “salesforce” costs (even though really the only people they sell into are HMV now, and from last year they’ve stopped guaranteeing racking in all but the top 6 or so stores in the UK, it’s a joke). You can’t sell your discs through at full retail, you have your wholesale (Dealer) price. We’ve sold albums through at £6.65 and I’ve later seen them in a London HMV for £12.99. Oh, and did I mention that supermarkets and stores like HMV *DEMAND* what they call a “file discount” of up to 40% just to take stock? (which is on a non-negotiable sale or return basis with up to a six month returns period.)

If you end up in a position where you don’t sell stock through into shops, it usually costs less for your distro to SHRED your discs than it does to send it back to you! Ridiculous. The costs are stacked against the labels at all points – incredibly frustrating. And that’s even before you begin to contemplate any plugging, promo, advertising, miscellaneous online, merch, booking agent / gig costs… Or even an advance for the artist! But it gets better…

So, this figure of 63% which the old techdirt article might quote as truth where valid for major labels (who might also own distribution, management, publishing and studios under the same roof), the model quickly falls apart as soon as focus on a smaller label. I used to think the whole model was bullshit and the artists got shafted, but if anything it’s level pegging – smaller labels have just as tough a time as artists as the risk to them to fund any new release is proportionally WAY larger. Also, the techdirt article works on the basis of the artist receiving a 20% royalty – this is dismal, and the artist should be smacked for agreeing to such a pitiful rate like the chumps they probably (hypothetically) are.

Take one of our real world iTunes scenarios – from a 79p purchase, iTunes immediately keeps about 32p. For UK and most worldwide sales, this also includes the royalties which the label’s obliged to pay (in the UK, to the MCPS-PRS Alliance). However, the USA requires the selling party to pay the mechanical on each sale (an arse-about-tit form which has arisen from the disconnected Collection Agencies – Harry Fox Agency being the incumbent on Mechanicals and ASCAP, BMI and SESAC on the Performance royalties – which adds yet another level of complication.

From what’s left (47p), you halve the resulting amount on a 50/50 deal. Neither the label nor the artist gets much for their work. On some artists whom we’ve purely done digital distribution for (on a rolling licence agreement), we give the artist 80% of net. As you can imagine, we get virtually nothing – and our income’s directly tied to their success, so we have an interest in seeing them do well. It’s a tough environment to be in.

For receiving US/Canadian/Mexico/European/Australasian payments, we first have to receive the currency and have the bank convert it to GBP. Of course, we can’t get the Interbank rates, nobody but the banks get those – so more money’s immediately lost in conversion. The larger labels will have sweetheart deals with their banks (or almost certainly have accounts in each relevant territory) so this isn’t so much of a big deal, but the amount of administration just scales inordinately. If you deal with managing your artists’ Publishing rights, you can quickly become LITERALLY swamped in paperwork. The amount of time sucked up by adminning the release of music is extraordinary.

So please nobody think all music labels have it easy… I have no doubt that the Big Four have royally shafted artists in the past but they can largely lumber along based on a few artists doing exceptionally well for the rest of their current roster (with their back catalogue from very famous artists helping too). The problem they’re going to have is that almost none of the artists whose catalogue’s been released in the past two decades *really* has the staying power of the classic artists – Dire Straits, Genesis, Pink Floyd, The Who or Fleetwood Mac, just to name five off the top of my head. Don’t even get me started on the epic fail that is streaming revenues from Spotify, mFlow, We7 etc.

Now even with all of this, I still regard sites like YouTube as a promotional tool. Some of our most famous catalogue I’ve held off on issuing DMCA takedowns for, because it’s a genuinely beneficial promotional tool – it’s the pragmatic response. Where do people go first if they want to quickly listen to a track? YouTube! What happens if they only ever wanted to hear it once and never again? You’ve not lost that sale because it almost certainly would never have happened. What happens if they still want to have a copy of that track? They’ll go buy it from one of the easily accessible venues, it’s not expensive to do. The label’s job is to make the catalogue ubiquitous on all of the major (and some of the trendier niche stores) where at all possible. The digital distribution costs are another thing the label has to absorb – monthly, per track, per store usually, if not on an aggregation deal where it’s a percentage on each sale but the label usually ends up worse off. It’s a tough position because the label almost always feels the need to protect their ‘content’ (shudder – hate that word) but issuing takedowns for every instance of a track is more often than not a kneejerk reaction which harms longterm sales. I’m personally torn between leaving them, taking them down or even putting up better mashup/promo mix versions on the label’s official account!

Treat your customers like adults and I think you earn their respect a bit more. This applies to all forms of digital media, including tellybox shows. (thesis: DRM = genuinely unhelpful towards nurturing that unique supportive viewer-provider relationship. Trust your customers, they’ll not disrespect you.) In music, nobody wants to buy a track if they can never audition it, and 30sec samples aren’t really a good enough.

Restoring Ubuntu 10.4’s Bootloader, after a Windows 7 Install

I installed Windows 7 after I had installed Ubuntu 10.4. Windows 7 overwrote the Linux bootloader “grub” on my master boot record. Therefore I had to restore it.

I used the Ubuntu 10.4 LiveCD to start up a live version of Ubuntu. While under the LiveCD, I then restored the Grub bootloader by chrooting into my old install, using the linux command line. This is a fairly complex thing to do, and so I recommend you use this approach only if you’re are confident with the linux command line:

# (as root under Ubuntu's LiveCD)

# prepare chroot directory

mkdir /chroot

# mount my linux partition

mount /dev/sda1 $d   # my linux partition was installed on my first SATA hard disk, on the first parition (hence sdA1).

# mount system directories inside the new chroot directory

mount -o bind /dev $d/dev
mount -o bind /sys $d/sys
mount -o bind /dev/shm $d/dev/shm
mount -o bind /proc $d/proc

# accomplish the chroot

chroot $d

# proceed to update the grub config file to include the option to boot into my new windows 7 install


# install grub with the new configuration options from the config file, to the master boot record on my first hard disk

grub-install /dev/sda

# close down the liveCD instance of linux, and boot from the newly restored grub bootloader


Windows 7 Gaming on my Macbook

I have a 2006/2007 Core 2 Duo 2.6ghz white macbook, that I use regularly for internet, music, watching films, itunes and integration with my iPhone.

I wanted to turn my desktop PC into a ‘work only’ Ubuntu Linux machine, so that I don’t get distracted when I’m supposed to be doing something else.

But I still have a lot of PC games that I wanted to play on the Macbook, so I decided to try and setup a windows environment to play games on using Bootcamp 2.0 to create a dual-boot OSX/Windows 7 configuration.

It turns out it works really well. The Macbook runs Windows 7 64-bit edition fine, and although the integrated graphics card isn’t designed to run modern games very well, you can get a good gaming experience from small indie games and the older type of PC RPGs that I tend to play. My macbook got a 3.5 rating on the windows experience index for graphics, which is sufficient for many PC games.

First you need to partition your macbook’s HD using the Bootcamp assistant, in the OSX utilities section. Make sure you have your first OSX installation DVD to hand, the one that came with your Macbook. I chose to split the hard drive into two equally sized partitions. Then just place your W7 DVD in the drive, and Bootcamp takes care of the rest.

Once W7 is installed, you can access the Bootcamp menu on startup by holding down the option key. This brings up a menu where you can select to boot into OSX or Windows.

When you start W7 for the first time, you can install the windows driver set for your Macbook that Bootcamp provides you. Insert your OSX installation DVD 1, and run the setup.exe that is located in the Bootcamp folder. This will install native windows drivers for your Macbook hardware.

The only change I needed to make for my macbook, was to install the latest 64bit Realtek drivers for Vista/Windows 7, which are located on the Realtek website. This will fix any sound problems you might have while playing games.

Now don’t expect to run the latest 3D games, but if you’re happy enough with slightly older, classic, indie or retro games, you can get a good gaming experience on Windows 7 from your macbook. It does well with plenty of the indie games available on Value’s Steam distribution network.

Ripping Movies onto the iPhone

I’m currently watching Persepolis, the 2008 animated film about a tomboy anarchist growing up in Iran. I’m watching this on my new iPhone 3GS, and the picture and audio quality is very good.

Here’s what I used to convert my newly bought Persepolis DVD, for watching on the iPhone.

1x Macbook (but you can use any intel mac)
1x iTunes
1x RipIt – Commercial Mac DVD Ripper (rips up to 10 DVDs on the free trial, $20 after)
1x Handbrake 32 – Freely available transcoder
1x VLC 32 – Freely available media player
1x DVD

* Ripit – rips the video and audio from the DVD, onto your computer
* Handbrake 32 – ‘transcodes’ the ripped video and audio, meaning – it converts it into an iPhone compatible video file.
* VLC 32 – is used by Handbrake 32 to get past any problems with converting the media.

Go to the following sites to fetch the software:

1. Ripit – http://thelittleappfactory.com/ripit/
2. Handbrake 32 – http://handbrake.fr/downloads.php (get the 32 bit version)
3. VLC 32 – http://www.videolan.org/vlc/download-macosx.html (be sure to get the 32 bit version)

There’s currently a difficulty in getting the VLC 64 bit software for the Mac, and so although the 64 bit version is faster to use, you’re probably better off with 32 bit versions of both for now.

The Process

1) Rip the DVD.

Start RipIt. It will ask for a DVD, insert the DVD.. and point the resultant save location to the desktop. The ripping process takes about 40 minutes on my Macbook, you can check the progress by looking at the icon in the dock – it will be updated with the percentage of progress until completion. You can do other things on your mac while it’s ripping, even though the DVD drive will be occupied. Wait until it’s completed before continuing.

2) Transcode (convert) the ripped video file for use on the iPhone.

Start Handbrake. There are a bunch of transcoding settings called presets – those tell Handbrake what type of media player you want the converted video to work on. In handbrake on the right section of the window, select the iPhone preset. Then go to the file menu, select ‘Open’, and then select the video file that RipIt saved onto your desktop. Then select the destination for the converted video file. Then select the Start (green) button on Handbrake window, and it will start. You can now minimise handbrake and do other things. The transcoding process depends on the film, but takes about an hour on my Macbook. You can check on progress by maximizing the Handbrake window, and checking on the progress bar.

3) Move the converted video file onto your iPhone.

Once that’s done, you will have another media file on your desktop – this is the end result, a video file that will play on your iPhone. Simply connect your iPhone to your Mac, start up iTunes, and drag that file from your desktop into the iPhone icon on your iTunes window. It will take a couple of minutes to transfer, then eject the iPhone as normal

Now you can watch this new movie on your iPhone by going to the ‘Videos’ tab of your iPod app.

WordPress HTML edit mode inserts BR tags sometimes when you add a carriage return..

This is something that was quite annoying today, as I was struggling to use WordPress 2.9.2 to align some pictures in the HTML mode of editing a page, on a client’s website.

It turns out that WordPress was adding BR tags sometimes when I hit return.. and sometimes not. The annoying thing was, although the BRs were outputted in the resultant WordPress site, the BRs were not visible in the WordPress HTML edit mode itself.. meaning they were invisible and undetectable until I viewed the resultant website source and finally figured it out.

WordPress does insert some formatting tags now and then, it seems, but I would have thought it would tell you about the tags that would change the page layout! Apparently not. Anyway, something to be aware of for WordPress gurus..


I don’t have time to report this as a bug, but this is the stack I’m using for anyone interested:

Browser: Google Chrome for Mac (5.0.342.9 beta)
TinyMCE Advanced Editor Plugin for WP (3.2.7)
Wordpress 2.9.2

The beta of Google Chrome is a bit unstable, although it may not be the source of the problem.

Forkbombs and How to Prevent Them

A forkbomb is a program or script that continually creates new copies of itself, that create new copies of themselves. It’s usually a function that calls itself, and each time that function is called, it creates a new process to run the same function.

You end up with thousands of processes, all creating processes themselves, with an exponential growth. Soon it takes up all the resources of your server, and prevents anything else running on it.

Forkbombs are an example of a denial of service attack, because it completely locks up the server it’s run on.

More worryingly, on a lot of Linux distributions, you can run a forkbomb as any user that has an account on that server. So for example, if you give your friend an account on your server, he can crash it/lock it up whenever he wants to, with the following shell script forkbomb:

:(){ :|:& };:

Bad, huh?

Ubuntu server 9.10 is vulnerable to this shell script forkbomb. Run it on your linux server as any user, and it will lock it up.

This is something I wanted to fix right away on all my linux servers. Linux is meant to be multiuser, and it has a secure and structured permissions system allowing dozens of users to log in and do their work, at the same time. However when any one user can lock up the entire server, this is not good for a multiuser environment.

Fortunately, fixing this on ubuntu server 9.10 is quite simple. You limit the maximum number of running processes that any user can create. So the fork bomb runs, but hits this ceiling, and eventually stops without the administrator having to do anything.

As root, edit this file, and add the following line:


*               soft    nproc   35

This sets the maximum process cap for all users, to be 35. The root user isn’t affected by this limit. This limit of 35 should be fine for remote servers that are not offering users gnome, kde, or any other graphical X interface. If you are expecting your users to be able to run applications like that, you may want to increase the limit to 50, and although this will increase the time forkbombs will take to exit, they should still exit without locking up your server.

Alternatively, you can setup an ‘untrusted’ and ‘trusted’ user groups, and assign that 35 limit to the untrusted users, giving trusted users access to the trusted group, which does not have that limit. Use these lines:


@untrusted               soft    nproc   35
@trusted               soft    nproc   50

I’ve tested these nproc limits on 8.10 and 9.10 ubuntu-server installs, but you should really test your own servers install, if possible, by forkbombing it yourself as a standard user, using the bash forkbomb above, once you’ve applied the fix. The fix is effective as soon as you’ve edited that file, but please note that you have to logout, and log back in again as a standard user before the new process cap is applied to your user account.

How to remove nano, vim and other editors’ backup files out of a directory tree

gardening for science..

Linux command-line editors such as nano and vim often, by default, create backup files with the prefix of “~”. I.e, if I created a file called /home/david/myfile, then nano would create a backup in /home/david/myfile~. Sometimes it doesn’t delete them either, so you’re left with a bunch of backup files all over the place, especially if you’re editing a lot on a directory tree full of source code.

Those stray backup files make directory listings confusing, and also add unnecessary weight to the commits on source control systems such as svn, cvs, git.. etc. If you’re working on a programming team with other people, then it causes further problems and confusion, because person A’s editor can accidentally load person B’s backup file.. etc etc. Nightmare.

So instruct your editor, or the programming team you’re working with, not to drop these backup files. You can configure most editors to change the place where the editor drops its backup files, so you could store all your backup files in a subdirectory of your home directory, for example, if needed. However I always set my editors not to leave backup files about.

Once you know that new backup files will not be created, view the current list of backup files, along with the user that created them.. so you know who’s been creating the backup files and when, etc:

find . -name '*~' -type f -exec ls -al {}  ;

Then archive the stray backup files, with this command:

find . -name '*~' -type f -exec mv -i {} ./archived-backups ;

That will find all backup files in the current directory and below, and move them all to a subdirectory in the current directory called ‘archived-backups’. This is a fairly safe find/exec command, because with the -i switch, mv will not ‘clobber’. This means If you have two backup files, one in /opt/code/index~ and one in /opt/code/bla/bla/index~, they will not ‘clobber’, or overwrite each other automatically when moved into the new directory. You will be informed of any conflicts present so you can resolve them yourself.

However in practice I usually omit the ‘-i’ switch and let them clobber each other, because I usually end up deleting the ./archived-backups/ directory very quickly after that anyway.

Tip for watching the completion of a large file copy

Forget the wonderful windows progress bar, and imagine I’m in the world of command-line Linux, and I want to copy a 484MB file, called VMware-server-2.0.2-203138.i386.tar.gz, from my home directory to a remote server. But I want to figure out how long it’s going to take.

1. First I can run a “du -m” command to get the total MB size of the original file:

du -m /home/david/VMware-server-2.0.2-203138.i386.tar.gz


david@believe:~$ du -m VMware-server-2.0.2-203138.i386.tar.gz
484 VMware-server-2.0.2-203138.i386.tar.gz

Now I know it is approximately 484MB.

2. Then I run the copy. I’m copying the file from /home/david/ to /opt/remote/myserver, which is a remotely mounted directory on a server somewhere in Canada.

david@believe:~$ cp ./VMware-server-2.0.2-203138.i386.tar.gz /opt/remote/myserver/

At this point cp will just hang until it’s finished. There is normally no progress indicator or anything. But I want to figure out how much of the file has been copied, so I can figure out how much is left to copy, and get a rough idea of the progress.

3. So I SSH into the remote server in Canada, and run this command

david@myserver:~$ watch du -m ./VMware-server-2.0.2-203138.i386.tar.gz

the copy command by default seems to be incremental, ie: piece by piece, not all at once. Therefore with the “Watch” command, you can watch the size, in MB, of the new file as it accumulates. The watch command will refresh every 2 seconds, so you’ll be updated as the copy goes on.

You can probably invoke a progress meter with the cp command, or use rsync. Rsync is much better for large file copies, and remote file copies. But the advantage of the method above is that you can watch file copies already executed without any special arguments, which I sometimes find very useful when I remember that that file I already started copying isn’t 200MB.. it’s actually 2.5GB.

The Linux Root Directory, Explained

It’s helpful to know the basic filesystem on a Linux machine, to better understand where everything is supposed to go, and where you should start looking if you want to find a certain file.

Everything in Linux is stored in the “root directory”. On a windows machine, that would be equivalent to C:. C: is the main folder where everything is stored. On Linux we call this the “root directory”, or simply “/”. To go up to this root directory, type:

cd /

To list all the folders and files in the root directory, type this:

ls /

Alternatively, if you want to see the folders and files exactly the way I see them below for easy comparison, type this:

ls -lhaFtr --color /

Once you’ve typed in one of the ‘ls’ commands above, you’ll see some information similar to that on the screenshot below.. (please scroll down)..

Ubuntu Linux

Above you can see the files and folders in the root directory of my ubuntu linux server, after I’ve typed ‘ls /’. Ignore everything but the coloured names on the right, those coloured names are the names of the files and folders in this directory. Don’t worry about the shades of different colours either. It’s not really important to explain how they are coloured right now, just to explain the purpose behind each file or folder shown.

So let me explain the purpose behind each of these, in turn. I’ll include the same screenshot multiple times, so you can reference the explanations against it as you scroll down.


– Directory for linux security features, rarely visited by normal users like you or me.


– Traditional directory for the files from removable media, ie USB keys, external hard drives. Not used anymore, it only exists for historical purposes.


– Directory where files and directories end up when they’ve been recovered from a hard disc repair.

 cdrom -> media/cdrom/

– Link the files currently in your CDROM or DVDROM drive.


– New style directory for the files from removable media such as USB keys, external hard drives, etc. This is the new convention, and so you should always use media/ instead of mnt/, above.

vmlinuz.old -> boot/vmlinuz-2.6.31-17-generic

– A backup of your most recent old Linux operating system kernel, ie: your operating system. Don’t delete this =)

initrd.img.old -> boot/initrd.img-2.6.31-17-generic

– Another part of the backup for your most recent old Linux kernel.


– An empty directory reserved for you to put third-party programs and software in.


– Operating system drivers and kernel modules live here. Also contains all system libraries, so when you compile a new program from the source code, it will use the existing code libraries stored here.


– Basic commands that everyone uses, like “ls” and “cd”, live here.


– This is where all user-supplied software should go; ie: software that you install that doesn’t normally come with the operating system. Put all programs here.


– Basic but essential system administration commands that the admin user only uses, ie: reboot, poweroff, etc.

vmlinuz -> boot/vmlinuz-2.6.31-20-generic

– Your actual operating system kernel, ie: the one that is running right now. Don’t delete this.

initrd.img -> boot/initrd.img-2.6.31-20-generic

– Another part of the kernel that is running right now.


– Reserved for Linux kernel files, and other things that need to be loaded on bootup. Don’t touch these.


– Proc is a handy way of accessing critical operating system information, through a bunch of files. Ie: try typing ‘cat /proc/cpuinfo’. That queries the current kernel for the information on your processors (CPUs), and returns the info for you in a text file.


– Like proc/, this is another bunch of files that aren’t files at all, but ‘fake’ files. When you access them, the operating system goes away and finds out information, and offers that information up as a text file to you.


– Device files. In here live the device files for your hard drives, your CD/DVD drives, your soundcard, your network card.. in fact anything you have installed that Linux uses, it has a counterpart in here that is automatically added and removed by the OS. Don’t ever delete, move or rename any of the files here.


– The directory that you’ll use the most. Every user on your Linux machine, except the system administrator, has a folder here. This is where each user is meant to store all their documents. Think of it as the Linux ‘My Documents’ folder.


– This is a catch-all directory for ‘variables’, ie things that the OS has to write to, and vary, as part of its operation. Examples include: email inboxes for all users, cache files, the lock files that are generated and removed as part of normal program execution, and also the /var/www directory. /var/www is a directory you will probably see and use a lot, as it is where all the websites are stored that your linux machine serves when operating as a web server. /var/log is also a very important directory, and contains ‘log’ files which is a kind of “diary” that the linux OS uses to explain exactly what it’s done, as it happens, so you can easily find out what’s been going on by viewing the right log file.


– The space for any and all temporary files. Store files here that you want to throw away quite quickly. Depending on your configuration, all files and folders in the /tmp directory may be deleted on system reboot, or more frequently, perhaps every day.


– This is the system administrators ‘my documents’ folder. Anything that the sysadmin stores, for example: programs that he downloads, are put here. Not accessible to anyone else but the system administrator.


– Configuration files. Any and all program configuration files or information belong here. Think of it like the windows registry, except every registry entry is a text file that you can open up and edit, and also copy, move around, and save. You will typically have to create configuration files yourself sometimes, and put them in this directory. They are almost always simple text files.

And that’s a basic overview of the files and folders in the root directory of your linux machine.

Useful OSX commands for Linux users

I wrote this list to remind me, as a newcomer to OSX, how the command line differed from the Linux commandline. I thought I’d expand on it, and share it:

To mount any iso:

hdiutil mount sample.iso

To download a file as you would using wget:

curl http://ftp.heanet.ie/pub/linuxmint.com/stable/8/LinuxMint-8.iso -o linuxmint.iso -C -

the -o specifies the output file (required)
the -C – specifies automatically resuming if possible.

To burn a bootable iso to CD, DVD or USB key:

use the “diskutil” program as described in: http://forums.macrumors.com/showthread.php?t=598291

Monitor disk io utilisation.. poll once per second

iostat -c 99999

will run until 99999 seconds have passed.

Monitor CPU and memory utilisation.. polling per second


Just like Linux.

Mount Windows Shares

mount -t smbfs //@/ 


mount -t smbfs //davec@SERVER/Dev samba-to-netdev

then it will appear mounted in /Volumes with the mount point name you supplied, ie: /Volumes/samba-to-netdev/.

Long Bash History Files are Great.

When I’m installing software, or doing some complicated stuff on the linux command line, which nowadays is pretty much all the time, I will sometimes want to remember exactly what I typed.

Now the normal /home/david/.bash_history file is usually fine for that. Run this command, for example, and you will see the commands you typed in before you logged out of the server last time you used it:

cat ~/.bash_history

You can also find out what you typed in this session, ie: since you logged in, by typing this:


This is great, and it’s even more useful if you add a grep pipeline, so you can search through the previous commands you typed in for a particular phrase or command, ie:

history | grep apt-get

However what I really want nowadays is an almost infinite bash_history file, so I can find out not just what I did last week, but two weeks ago, or last month or perhaps last year. Now there are obvious security risks involved with this, and to make sure you don’t accidently store mistyped passwords to other systems, or other things, you should probably make sure you never type them in on the command line. This is good practice anyway, and since I use key’d sshd logins exclusively nowadays, there is not much chance of me tripping up, typing a password into the terminal, and then forgetting about it. In theory however, using long/infinite bash_history files does mean that if anyone compromised your shell account, they’d have any passwords to systems that you mistyped.

So I’m careful with this. You can also clear your history file quite quickly if you do accidently find you’ve messed up. Log out, log back in again, and just do this:

echo  > ~/.bash_history

Then that will delete all the previous logged commands.

Apart from serving as a major memory aid to complicated install work, and a log for those increasingly complicated chained, piped, one-liners that I’m fond of but only really want to have to type once, there are other benefits to keeping a large bash_history file. The main one is that it makes it easy to convert your previous commands into a handy shell script or two, which you can set to run at a specific time of day via cron.. or even make into a system-wide command for other users to use.

OK so hopefully I’ve convinced you that it can be very useful to have a long, persistent, bash_history file. But how do you configure the shell so that it does this for you? The following is the magic customization lines that I use on my personal desktops, laptops, and any other trusted computers that I think are reasonably free from the risk of people hacking in just to retrieve my .bash_history file..:

## bash history db
# increase the history file size to 20,000 lines
export HISTSIZE=20000
# append all commands to the history file, don't overwrite it at the start of every new session
shopt -s histappend

The above will give you an (almost) infinite bash_history file. It will start deleting old commands at 20,000 lines, ie: 20,000 commands. Make sure you have enough disk space for that. My .bash_history file is currently at around 200KB, not a huge file by any means. I’d say it will grow to 400-600KB max. If you want to calculate approximatly how much it will use, then in bytes, it’s the number of characters in your average linux command x 20,000.

My minimal VIM config

This is the absolute minimum I do when I have to log onto a new server or shell account that I haven’t used before, that I will need to edit text files with.

First I figure out whether VIM is really installed. A lot of installs, especially those based on ubuntu, ship with VI aliased to VIM, but the VIM install is usually not really VIM at all, and behaves exactly like VI but with some minor bugs fixed. This is not what I want.

So first I figure out what distribution of linux I’m using through executing the following command:

cat /etc/issue

Then if it’s ubuntu, which doesn’t ship with the full VIM package on a lot of default installs, then I usually do this, presuming I have admin access. In practice I usually have admin access because people are generous with this when they want you to fix their server =) Anyway, if I have admin access, I install ubuntu’s ‘vim full’ package, which is aliased as ‘vim’:

sudo apt-get install vim

Now I can move onto my config. Occasionally there will be a global system config, but I probably want to override that anyway. So I create a vim configuration file specific to me in my home directory:

set bg=dark
set backspace=2

The first line sets the background to be dark, so I can see what is going on when I use a dark terminal program, such as putty, mac osx’s terminal.. in fact nearly all terminal programs use a dark background, so this setting is almost compulsory.

The second line configures the behaviour of the backspace key, so when I go the the start of a line, and press backspace, it adopts the wordprocessor conventional behaviour of skipping to the above line. Otherwise it uses the default VI behaviour, which is probably not intuitive at all to anyone who didn’t grow up on UNIX mainframes and such.

The very existence of a user-supplied configuration file will also jolt the VIM editor into ‘non compatible mode’, where it figures out automatically that it should be doing all the advanced VIM things, instead of just acting as a VI replacement. This should mean that if you create a config file, syntax highlighting is already turned on, another must for me. Otherwise you can explicitly set it with the line ‘syntax on’, but I never have to do this anymore.

And that’s it.

Using the Linux command ‘Watch’ to test Cron jobs and more

OK, so you have added a cron job that you want to perform a routine task every day at 6am. How do you test it?

You probably don’t want to spend all night waiting for it to execute, and there’s every chance that when it does execute, you won’t be able to find out whether it is executing properly – the task might take 30 minutes to run, for example. So every time you debug it and want to test it again, you have to wait until 6am the following day.

So instead, configure that cron job to run a bit earlier than that, say in 10 minutes, and monitor the execution with a ‘watch’ command, so you can see if it’s doing what you want it to.

‘watch’ is a great command that will run a command at frequent intervals, by default, every 2 seconds. It’s very useful when chained with the ‘ps’ command, like the following:

watch 'ps aux | grep bash'

What that command will do, is continually monitor your server, and maintain an updated list that changes every 2 seconds, of every instance of the bash shell. When someone logs in and spawns a new bash shell, you’ll know about it. When a cron’d command runs that invokes a bash shell before executing a shellscript, you’ll know about it. When someone writes a badly written shell script, and runs it invoking about 100 bash shells by accident, flooding your servers memory, you’ll know about it.

OK so back to the cron example. Suppose I’m testing a cronjob that should invoke a shell script that runs an rsync command. I just set the cron job to run in 5 minutes, then run this command:

watch 'ps aux | grep rsync'

Here is the result.. every single rsync command that is running on my server is displayed, and the list is updated every 2 seconds:

Every 2.0s: ps aux | grep rsync                                              Sat Mar 13 15:59:35 2010

root     16026  0.0  0.0   1752   480 ?        Ss   15:28   0:00 /bin/sh -c /opt/remote/rsync-matt/cr
root     16027  0.0  0.0   1752   488 ?        S    15:28   0:00 /bin/sh /opt/remote/rsync-matt/crond
root     16032  0.0  0.1   3632  1176 ?        S    15:28   0:00 rsync -avvz --remove-source-files -P
root     16033  0.5  0.4   7308  4436 ?        R    15:28   0:09 ssh -l david someotherhost rsync --se
root     16045  0.4  0.1   4152  1244 ?        S    15:28   0:07 rsync -avvz --remove-source-files -P
root     18184  0.0  0.1   3176  1000 pts/2    R+   15:59   0:00 watch ps aux | grep rsync
root     18197  0.0  0.0   3176   296 pts/2    S+   15:59   0:00 watch ps aux | grep rsync
root     18198  0.0  0.0   1752   484 pts/2    S+   15:59   0:00 sh -c ps aux | grep rsync

Now I can see the time ticking away, and when the cron job is run, I can watch in real-time as it invokes rsync, and I can keep monitoring it to make sure all is running smoothly. This proves to be very useful when troubleshooting cron jobs.

You can also run two commands at the same time. You can actually tail a log file and combine it with the process monitoring like so:

watch 'tail /var/log/messages && ps aux | grep rsync'

Try this yourself. It constantly prints out the last ten lines of the standard messages log file every two seconds, while monitoring the number of rsync processes running, and the commands used to invoke them. Tailor it to the cron’d job you wish to test.

Watch can be used to keep an eye on other things also. If you’re running a multi-user server and you want to see who’s logged on at any one time, you can run this command:

watch 'echo CURRENT: && who && echo LASTLOGIN: && lastlog | grep -v Never'

This chains 5 commands together. It will keep you updated with the current list of users logged in to your system, and it will also give you a constantly updated list of those users who have ever logged in before, with their last login time.

The following shows the output of that command above on a multi-user server I administrate, and will refresh with current information every 2 seconds until I exit it:

Every 2.0s: echo CURRENT: && who && echo LASTLOGIN: && lastlog | grep -v Never                                                             Sat Mar 13 07:48:32 2010

mark     tty1         2010-02-23 11:08
david    pts/2        2010-03-13 07:48 (wherever)
mike     pts/4        2010-02-26 07:53 (wherever)
mike     pts/5        2010-02-26 07:53 (wherever)

Username         Port     From           Latest
mark               pts/6    wherever      Thu Mar 11 23:24:36 -0800 2010
mike               pts/0    wherever      Sat Mar 13 03:54:28 -0800 2010
dan                pts/4    wherever      Fri Jan  1 08:46:29 -0800 2010
sam                pts/1    wherever      Sat Jan 30 08:06:01 -0800 2010
rei                pts/2    wherever      Thu Dec 10 11:45:39 -0800 2009
david              pts/2    wherever      Sat Mar 13 07:48:05 -0800 2010

This shows that mark, david and mike are currently logged on. Mark is logged in on the server’s physical monitor and keyboard(tty1). Everyone else is logged in remotely. Mike currently has two connections, or sessions, on the server. We can also see the list of users that have logged in before – ie: are active users, and when they last logged on. I immediately notice, for example, that rei hasn’t logged in for 4 months and probably isn’t using her account.

(Normally this command will also provide IP addresses and hostnames of where the users have logged on from, but I’ve replaced those with ‘wherever’ for privacy reasons)

So.. you can see that the ‘watch’ command can be a useful window into what is happening, in real-time, on your servers.

Changing the default “From:” email address for emails sent via PHP on Linux

I’ve had to solve this problem a couple of times at least, and it’s quite a common task, so I thought I’d document it here.

When you send emails to users of your site through using the PHP mail() function, they will sometimes turn up in the mailbox of customers of your site with the following from address:

From: Root <root@apache.ecommercecompany.com>

This makes absolutely no sense to your customers, and often they will think it is spam and delete it. Often, the decision will be made for them by their web mail host, such as hotmail.com or googlemail.com, and they will never even see the email. You don’t want this to happen.

Writing email templates that appear “trustworthy” and have a low chance of being mislabled as spam by the webmail companies, is quite a difficult task, and there’s quite a bit to know about it. However it is quite easy to change the default “From:” email address that PHP sends your emails on as, and that will definitely help.

Assuming you’re running a linux server using sendmail, all you have to do is this.

First create an email address that you would want the customers to see, through editing the /etc/aliases files and running the command newaliases. I created an email address called customer-emails@ecommercecompany.com.

Then change the following sendmail_path line in your php.ini file to something like this:

sendmail_path = /usr/sbin/sendmail -t -i -F 'customer-emails' -f 'Customer Emails <customer-emails@ecommercecompany.com>'

Broken down, those extra options are:
-F 'customer-emails' # the name of the sender
-f 'Customer Emails <customer-emails@ecommercecompany.com>' # the email From header, which should have the name matching the email address, and it should be the same email address as above

Then restart apache, and it should load the php.ini file changes. Test it by sending a couple of emails to your email address, and you should see emails sent out like this:

From: Customer Emails <customer-emails@ecommercecompany.com>

Shell scripts for converting between Unix and Windows text file formats

I’ve been using these shell scripts I wrote to convert between unix and windows text file formats. They seem to work well without any problems. If you put them in the /usr/sbin/ directory, they will be accessible on the path of the linux admin account root.

# Converts a unix text file to a windows text file.
# usage: unix2win <text file to convert>
# requirements: sed version 4.2 or later, check with sed --version
sed -i -e 's/$/r/' $1

# Converts a windows text file to a unix text file.
# usage: win2unix <text file to convert>
cat $1 | tr -d '15' | tee $1 >/dev/null

I use these scripts with the combination of find and xargs to convert lots of log files into windows format with the following command. However this type of command can be dangerous, so don’t use it if you don’t know what you’re doing:

find sync-logs/ -name '*.log' -type f | xargs -n1 unix2win

Site Redesign

I’ve just updated the design of this blog, re-enabled comments and added a contact tab. I’ve installed a strong anti-spam comment filter, but you should now be able to comment on entries. I’ve also changed the layout of things slightly, and made it easier to read.

PHP Sample – HTML Page Fetcher and Parser

Back in 2008, I wrote a PHP class that fetched an arbitary URL, parsed it, and coverted it into an PHP object with different attributes for the different elements of the page. I recently updated it and sent it along to a company that wanted a programming example to show I could code in PHP.

I thought someone may well find a use for it – I’ve used the class in several different web scraping applications, and I found it handy. From the readme:

This is a class I wrote back in 2008 to help me pull down and parse HTML pages I updated it on
14/01/10 to print the results in a nicer way to the commandline.

- David Craddock (contact@davidcraddock.net)


It uses CURL to pull down a page from a URL, and sorts it into a 'Page' object
which has different attributes for the different HTML properties of the page
structure. By default it will also print the page object's properties neatly
onto the commandline as part of its unit test.


* README.txt - this file
* page.php - The PHP Class
* LIB_http.php - a lightweight external library that I used. It is just a very light wrapper around CURL's HTTP functions.
* expected-result.txt - output of the unit tests on my development machine
* curl-cookie-jar.txt - this file will be created when you run the page.php's unit test


You will need CURL installed, PHP's DOMXPATH functions available, and the PHP 
command line interface. It was tested on PHP5 on OSX.


Use the php commandline executable to run the page.php unit tests. IE:
$ php page.php

You should see a bunch of information being printed out, you can use:
$ php page.php > result.txt

That will output the info to result.txt so you can read it at will.

Here’s an example of one of the unit tests, which fetches this frontpage and parses it:

*** Page Print of http://www.davidcraddock.net ***

** Transfer Status
+ URL Retrieved:
+ CURL Fetch Status:
    [url] => http://www.davidcraddock.net
    [content_type] => text/html; charset=UTF-8
    [http_code] => 200
    [header_size] => 237
    [request_size] => 175
    [filetime] => -1
    [ssl_verify_result] => 0
    [redirect_count] => 0
    [total_time] => 1.490972
    [namelookup_time] => 5.3E-5
    [connect_time] => 0.175803
    [pretransfer_time] => 0.175812
    [size_upload] => 0
    [size_download] => 30416
    [speed_download] => 20400
    [speed_upload] => 0
    [download_content_length] => 30416
    [upload_content_length] => 0
    [starttransfer_time] => 0.714943
    [redirect_time] => 0

** Header
+ Title: Random Eye Movement  
+ Meta Desc:
Not Set
+ Meta Keywords:
Not Set
+ Meta Robots:
Not Set
** Flags
+ Has Frames?:
+ Has body content been parsed?:

** Non Html Tags
+ Tags scanned for:
Tag Type: script tags processed: 4
Tag Type: embed tags processed: 1
Tag Type: style tags processed: 0

+ Tag contents:
    [ script ] => Array
            [0] => Array
                    [src] => http://www.davidcraddock.net/wp-content/themes/this-just-in/js/ThemeJS.js
                    [type] => 
                    [isinline] => 
                    [content] => 

            [1] => Array
                    [src] => http://www.davidcraddock.net/wp-content/plugins/lifestream/lifestream.js
                    [type] => text/javascript
                    [isinline] => 
                    [content] => 

            [2] => Array
                    [src] => 
                    [type] => 
                    [isinline] => 1
                    [content] => 
                 var odesk_widgets_width = 340;
                var odesk_widgets_height = 230;

            [3] => Array
                    [src] => http://www.odesk.com/widgets/v1/providers/large/~~8f250a5e32c8d3fa.js
                    [type] => 
                    [isinline] => 
                    [content] => 

            [count] => 4

    [ embed ] => Array
            [0] => Array
                    [src] => http://www.youtube-nocookie.com/v/Fpm0m6bVfrM&hl=en&fs=1&rel=0
                    [type] => application/x-shockwave-flash
                    [isinline] => 
                    [content] => 

            [count] => 1

    [ style ] => Array
            [count] => 0


*** Page Print of http://www.davidcraddock.net Finished ***

If you want to download a copy, the file is below. If you find it useful for you, a pingback would be appreciated.

Config files for the Windows version of VIM

Today I encountered problems configuring the windows version of the popular text editor VIM, so I thought I’d write up a quick post talk about configuration files under the Windows version, if anyone becomes stuck like I did. I use Linux, OSX and Windows on a day-to-day basis, and VIM as a text editor for a lot of quick edits on all three platforms. Here’s a quick comparison:


Linux is easy because that’s what most people who use VIM run, and so it is very well tested.

~/.vimrc – Configuration file for command line vim.
~/.gvimrc – Configuration file for gui vim.


OSX is simple also, as it’s based on unix:

~/.vimrc – Configuration file for command line vim.
~/.gvimrc – Configuration file for gui vim.


Windows is not easy at all.. it doesn’t have a unix file structure, and doesn’t have support for the unix hidden file names, that start with a ‘.’, ie: ‘.vimrc’, ‘.bashrc’, and so on. Most open-source programs like VIM that require these hidden configuration files, and have been ported over to windows, seem to adopt this naming convention: ‘_vimrc’, ‘_bashrc’.. and so forth. So:

_vimrc – Configuration file for command line vim.
_gvimrc – Configuration file for gui vim.

Renaming configuration files from “.” to “_” wouldn’t make much difference on its own. You’d have to rename your files, but.. big deal. It’s not much of a problem.

Another, more tricky, problem you may encounter however, is that there’s no clear home directory on windows systems. Each major incarnation of windows seems to have a slightly different way of dealing with user’s files.. from 2000 to XP, a change, from XP to Vista, there is a change. I haven’t tried VIM on W7 yet, but it seems similar to Vista in structure, so this information may actually be consistent to W7.

The Vista 64 version of VIM I have, looks in another place for configuration files. For a global configuration file, it looks in “C:Program Files”. Yes.. “C:Program Files”. According to Vista 64’s version of VIM.. that’s the exact directory where I installed VIM. This is clearly not right. What’s happening is that the file system on windows is different to the unix-type file systems, and the VIM port is having problems adapting. The real VIM install directory is C:Program Filesvim72. Because VIM is looking for a global configuration file in “C:Program Files_vimrc”, it’ll never find it.

Now you could override this with a batch file that sets the right environmental variables on startup, or you could change the environmental variables exported in windows, but I prefer to have a user-specified configuration file in my personal files directory, as it’s easier to backup and manage. If you wanted to specify the environmental variables yourself, which I’m guessing many will, the two environmental variables to override are:

$VIM = the VIM install directory, not always set properly, as I mentioned.
$HOME = the logged in user’s documents and settings directory, in windows speak this is also where the ‘user profile’ is stored, which is a collection of settings and configurations for the user. The exact directory will depend on which version of Windows you’re running, and if you override the HOME folder, you may have problems with other programs that rely on it being static.

On my Windows Vista 64 install:

$VIM = “C:Program Files”
$HOME = “C:UsersDave”

You can see what files VIM includes by running the handy command

vim -V

at a command prompt; it will go through the different settings and output something similar to this:

Searching for "C:UsersDave/vimfilesfiletype.vim"
Searching for "C:Program Files/vimfilesfiletype.vim"
Searching for "C:Program Filesvim72filetype.vim"
line 49: sourcing "C:Program Filesvim72filetype.vim"
finished sourcing C:Program Filesvim72filetype.vim
continuing in C:UsersDave_vimrc
Searching for "C:Program Files/vimfiles/afterfiletype.vim"
Searching for "C:UsersDave/vimfiles/afterfiletype.vim"
Searching for "ftplugin.vim" in "C:UsersDave/vimfiles,C:Program Files/vimfiles,C:Program Filesvim72,C:Program Files/vimfiles/after,C:UsersDave/vimfiles/after"
Searching for "C:UsersDave/vimfilesftplugin.vim"
Searching for "C:Program Files/vimfilesftplugin.vim"
Searching for "C:Program Filesvim72ftplugin.vim"
line 49: sourcing "C:Program Filesvim72ftplugin.vim"
finished sourcing C:Program Filesvim72ftplugin.vim
continuing in C:UsersDave_vimrc
Searching for "C:Program Files/vimfiles/afterftplugin.vim"
Searching for "C:UsersDave/vimfiles/afterftplugin.vim"
finished sourcing $HOME_vimrc
Searching for "plugin/**/*.vim" in "C:UsersDave/vimfiles,C:Program Files/vimfiles,C:Program Filesvim72,C:Program Files/vimfiles/after,C:UsersDave/vimfiles/after"
Searching for "C:UsersDave/vimfilesplugin/**/*.vim"
Searching for "C:Program Files/vimfilesplugin/**/*.vim"
Searching for "C:Program Filesvim72plugin/**/*.vim"
sourcing "C:Program Filesvim72plugingetscriptPlugin.vim"
finished sourcing C:Program Filesvim72plugingetscriptPlugin.vim
sourcing "C:Program Filesvim72plugingzip.vim"
finished sourcing C:Program Filesvim72plugingzip.vim
sourcing "C:Program Filesvim72pluginmatchparen.vim"
finished sourcing C:Program Filesvim72pluginmatchparen.vim
sourcing "C:Program Filesvim72pluginnetrwPlugin.vim"
finished sourcing C:Program Filesvim72pluginnetrwPlugin.vim
sourcing "C:Program Filesvim72pluginrrhelper.vim"
finished sourcing C:Program Filesvim72pluginrrhelper.vim
sourcing "C:Program Filesvim72pluginspellfile.vim"
finished sourcing C:Program Filesvim72pluginspellfile.vim
sourcing "C:Program Filesvim72plugintarPlugin.vim"
finished sourcing C:Program Filesvim72plugintarPlugin.vim
sourcing "C:Program Filesvim72plugintohtml.vim"
finished sourcing C:Program Filesvim72plugintohtml.vim
sourcing "C:Program Filesvim72pluginvimballPlugin.vim"
finished sourcing C:Program Filesvim72pluginvimballPlugin.vim
sourcing "C:Program Filesvim72pluginzipPlugin.vim"
finished sourcing C:Program Filesvim72pluginzipPlugin.vim
Searching for "C:Program Files/vimfiles/afterplugin/**/*.vim"
Searching for "C:UsersDave/vimfiles/afterplugin/**/*.vim"
Reading viminfo file "C:UsersDave_viminfo" info
Press ENTER or type command to continue

Notice how it does pull in all the syntax highlighting macros and other extension files correctly, which are specified in the .vim files above.. but it doesn’t pull in the global configuration files that I’ve copied also to C:Program Filesvim72_gvimrc and C:Program Filesvim72_vimrc. However, it does pickup the files I copied to C:UsersDave.. both the C:UsersDave_vimrc and C:UsersDave_gvimrc are picked up, although VIM will normally read ‘_gvimrc’ when the gui version of VIM is run (called gvim).

To see exactly what those environmental variables are being set to, when you’re inside the editor, issue these two commands, and their values will be show in the editor:

:echo $HOME
:echo $VIM

It seems to make sense for me – and perhaps you, if you’re working with VIM on windows – to place my _vimrc and _gvimrc files configuration files in $HOME in Vista. They are then picked up without having to worry about explicitly defining any environmental variables, creating a batch file, or any other hassle.

You can do this easily by the following two commands:

:ed $HOME_vimrc
:sp $HOME_gvimrc

That will open the two new configuration files, side by side, and you can paste in your existing configuration that you’ve used in Linux, and windows will pick them up the next time you start VIM.

Regex in VIM.. simple

There are more than a gazillion ways to use regexs. I am sure they are each very useful for their own subset of problems. The sheer variety can be highly confusing and scary for a lot of people though, and you only need to use a few approaches to accomplish most text-editing tasks.

Here is a simple method for using regex in the powerful text editor VIM that will work well for common use.


We are going to take the “search and delete a word” problem for an example. We want to delete all instances of the singular noun “needle” in a text file. Let’s assume there are no instances of the pluralisation “needles” in our document.

  1. Debug on.. turn some VIM options on
    :set hlsearch
    :set wrapscan
    – this will make all regex expressions possible to debug by visually showing what they match in your document (first line) and make all searches wrap around instead of just search forward from your current position, which is the default. (second line)
  2. Develop and Test.. your regex attempts by using a simple search. Here we see three attempts at solving the problem: :/needl
    – our third try is correct, and highlights all words that spell “needle”. The < and > markers allow you to specify the beginning and the end of a word. Play with different regexs using the simple search and watching what is highlighted, until you discover one that works for you.

  3. Run… your regex:%s/<needle>//g – once you’ve figured out a regex, run the regex on your document. This example will execute a search for the word “needle” and delete every one. If you wanted to substitute needle for another word, you would put the word in between the // marks. As we can see, there is nothing between the marks in this example, so it will replace instances of “needle” with nothing. This means it will serve to delete every instance of the word “needle”.
  4. Check things are OK… with your document :/<needle>
    – has the regex done what you want? Use the search function to see if regex has done what you wanted it to do. The above examples show different searches through the document to see if different variations remain. Any matches of these searches will highlight any problems. You can use the lower-case N(next search result) and lower-case P(previous search result) commands to navigate through any found search results. You must remember to manually look through the document and see what the regex has changed, make sure there aren’t any unwanted surprises!
  5. Recover… from any mistakes u – just press the U key (with no capslock or shift). This will undo the very last last change made to the document.
  6. Redo… any work that you need to <ctrl>-r – use the redo fuction; press the CONTROL and R keys together (with no capslock or shift). This will redo the last change made to the document.
  7. Finish up and Write… to file :w – write your work on the document to file. Even after you have written out to file, you can probably still use the undo function to get back to where you were, but it’s best practice to not rely on this, and only write once you’re done.
  8. Debug off.. turn some options off
    :set nohlsearch
    :set nowrapscan
    – turn off the regular expression highlighting (line 1). turn off the wraparound searching (line 2). You can leave either or both options on if you want, they’re often useful. Up to you.

Use a combination of these wonderful commands to test and improve your regex development skills in VIM.


Here I use the shorthand “#…” to denote comments on what I’m doing… if you want to copy and paste the example as written, then you will have to remove those comments.

1. Remove ancient uppercase <BLINK> tags from a document.

:set wrapscan # debug on
:set hlsearch # debug on
:/<BLINK> # try 1.. bingo! first time.. selected all tags I want
:%s/<BLINK>//g # lets execute my regex remove
:/BLINK # check 1.. testing things are OK in my file by searching through..
:/blinked # check 2.. yep thats ok..
:/<BLINK> # check 3.. yep looks ok... the problem tags are gone
# ...manual scroll through the document.. looks good!
:w # write out to file
:set nohlsearch # debug off
:set nowrapscan # debug off

2. Oh no! We missed some lower and mixedcase <bLiNK> tags that some sneaky person slipped in. Let’s take them out.

:set wrapscan # debug on
:set hlsearch # debug on
:/<blink> # try 1.. hm.. worked for many, but didnt match BlInK or blINK mixedcase
:/<blink>/i # try 2.. much better.. seems to have worked!
:%s/<blink>//i # lets execute my regex remove
:/BLINK # check 1.. testing things are OK in my file by searching through..
:/blinked # check 2.. yep thats ok..
:/<blink> # check 3.. yep thats fine.
:/<blink>/i # check 4.. looks good... problem solved
# ...manual scroll through the document.. looks much better!
:w # write out to file
:set nohlsearch # debug off
:set nowrapscan # debug off

3. Replacing uppercase or mixedcase <BR> tags with the more modern <br>.

:set wrapscan # debug on
:set hlsearch # debug on
:/<BR> # try 1.. hmm.. just uppercase.. not gonna work..
:/<br> # try 2.. hmm.. just lowercase..
:/<BR>/i # try 3.. ahh.. that'll be it then
:%s/<BR>/<br>/gi # lets execute my regex substitution
:/BR # check 1.. testing things are OK in my file by searching through..
:/br # check 2.. yep thats ok..
/bR # check 3 ..yup..
:/<BR>/i # check 4.. yep looks ok... the problem tags seem to be gone
# ...manual scroll through the document.. looks good!
:w # write out to file
:set nohlsearch # debug off
:set nowrapscan # debug off

For More..

Regexs are the gift that just keeps on giving. Here are some good resources on regexs in general, and regexs in VIM.