Netbeans for simple Java GUI Applications

I’ve been writing some simple Java GUI applications using the Netbeans IDE. It allows you to quickly make event-driven GUI applications, and generates a lot of skeleton code that you’ll need, but don’t necessarily want to type out. It reminds me of the IDE designer of Visual Basic 6, which allowed you to mock up simple GUIs with code in almost no time at all, although the VB language itself often proved difficult. With Netbeans you are using Java, and so you can make some powerful software with little effort.

Converting week numbers to dates

Here is some python code I adapted from this stackoverflow post to get the first day of a week specificed by a week number. This method includes leap year and summer time differences.

import time
def weeknum(num,year):
	instr = str(year)+" "+str(num-1)+" 1"
	print time.asctime(time.strptime(instr,'%Y %W %w'))

Here is me exectuting the code in Python’s IDLE shell:

See that the first week of 2009 actually started in 2008, but by the end of that week we are in 2009.

MediaMonkey allows you to transfer music from any computer onto your guest iPhone

MediaMonkey is a popular free media player for Windows. It has a great feature that allows you to transfer to and from an iPhone that is not registered with your computer. Normally only one iTunes install can be associated with your iPhone, but MediaMonkey allows you another way to transfer music and audio files with a ‘guest’ iPhone. Check it out, it works:


Applications I Reccomend

Software I use on my macbook & PC:

DVDRipper Pro for Mac – DVD ripping, can also rip to ISO
Handbrake for Mac – Transcoding from DVD rip to iPhone-playable file
iMovie for Mac – Video editing
BabasChess for Windows – Best chess client for internet play
Hypercam 2 – Best screencapture utility
Skype for both – For reliable messaging as well as voice and video chat
Virtual Clone Drive for Windows – For mounting ISO images
iTunes for Mac – Best music player, and keeps media synced with iPhone
VLC Player for both – For watching movies
DVD Player for Mac – For watching DVDs

iPhone Apps:

Skype – Best messenger
iBooks – Best ebook reader
London Buses – Best London transport router, can route via tube, bus, cycle path and foot
Tube Status – Displays the status of all lines, with any disruptions summarised
NextBuses – Great app that gives you lots of info on the buses and bus stops in your area.
Apple Remote – Apple remote, allows you to control the music on any wi-fi linked iTunes library’s Chess – Great chess game for vs. computer play
TasteKid – Type in a film, author, tv series.. and it will give you similar recommendations
Google Earth – Brilliant for navigational help, although I use iPhone’s inbuilt Maps first, for most things.
SomaFM – Chilled out relaxing electronica

Recording Game Videos on Windows 7

This is just a quick note to remind myself how I did this.

  • Hypercam2 is a good, free, video recorder that can cope with recording game videos. It’s freely available from – just make sure when you install it you don’t tick on the spyware toolbar installation options.
  • My motherboard has a 5.1 digital soundcard built in. However the only way I can record off the soundcard is to plug in a standard audio cable from the speaker out (green) to the microphone in (orange).
  • The soundcard switches off the headphone output when it detects a speaker attached to the speaker out, so you have to go to the recording options in Windows 7 and right click on the microphone in. It will give you an option to ‘Monitor this input using the headphones’ – which will allow you to listen to anything coming into the microphone socket through the headphone socket on the front on my PC.
  • In hypercam, set the sound to record from the default input device, set the frame rate to 10/10
  • Record using the ‘select window to record from’ option, select the game window, and use the F2 button to start and stop the recording.
  • The video will be output in AVI format, but you can transcode or convert it into a quicktime MOV file for editing in iMovie, or you can use windows movie editor, which is free and quite good.

Insights into a modern Indie Music label

I read this remarkable post on a public mailing list I subscribe to. I thought it was such a great insight into running a music label, that I just had to post it here. It discusses issues facing modern music, such as DRM, DMCA, and other ways of making (or losing) money. Fascinating.

Here it is:

I work for a (fairly small) indie label – from witnessing this model in action I feel I have to stick up for the label given that I see the model working (or sometimes not so well) on a daily basis! Where we’ve done deals with artists in the past, they’ve almost always been a 50/50 arrangement – the artist receives 50% of net royalties. Where a label fronts recording costs, these can easily become £6-10,000 for an album session. Even an EP session can be upwards of £1,500 although these figures are a little pessimistic (though not unrealistic). (We actually designed, built and owned studios for ten years until 2001 but the project haemorrhaged money.)

With regards to CD pressing, a 1,000 run will cost around £800 including full colour print in a basic jewel case. The AP1/AP2a MCPS licence costs another amount on top. When getting your CDs pressed, add in other things (Super Jewel cases, slip / O-cards, digipaks or gatefolds with high quality card / fancy posters) and you can easily top the 1k mark, not even counting the artwork design costs. Of course, discount comes with with bulk, but almost nobody except the Big Four do >1k discs in a pressing. (To put things in perspective: when SyCo have done the X Factor Finalists CDs, they press up >10,000 of EACH finalist’s recording of the song – and shred the losers’ copies when the winner is announced!)

To put stuff into distro with someone like Universal, you have your line costs simply to have the title listed on their system – monthly recurring, per title – then handling costs, despatch costs, “salesforce” costs (even though really the only people they sell into are HMV now, and from last year they’ve stopped guaranteeing racking in all but the top 6 or so stores in the UK, it’s a joke). You can’t sell your discs through at full retail, you have your wholesale (Dealer) price. We’ve sold albums through at £6.65 and I’ve later seen them in a London HMV for £12.99. Oh, and did I mention that supermarkets and stores like HMV *DEMAND* what they call a “file discount” of up to 40% just to take stock? (which is on a non-negotiable sale or return basis with up to a six month returns period.)

If you end up in a position where you don’t sell stock through into shops, it usually costs less for your distro to SHRED your discs than it does to send it back to you! Ridiculous. The costs are stacked against the labels at all points – incredibly frustrating. And that’s even before you begin to contemplate any plugging, promo, advertising, miscellaneous online, merch, booking agent / gig costs… Or even an advance for the artist! But it gets better…

So, this figure of 63% which the old techdirt article might quote as truth where valid for major labels (who might also own distribution, management, publishing and studios under the same roof), the model quickly falls apart as soon as focus on a smaller label. I used to think the whole model was bullshit and the artists got shafted, but if anything it’s level pegging – smaller labels have just as tough a time as artists as the risk to them to fund any new release is proportionally WAY larger. Also, the techdirt article works on the basis of the artist receiving a 20% royalty – this is dismal, and the artist should be smacked for agreeing to such a pitiful rate like the chumps they probably (hypothetically) are.

Take one of our real world iTunes scenarios – from a 79p purchase, iTunes immediately keeps about 32p. For UK and most worldwide sales, this also includes the royalties which the label’s obliged to pay (in the UK, to the MCPS-PRS Alliance). However, the USA requires the selling party to pay the mechanical on each sale (an arse-about-tit form which has arisen from the disconnected Collection Agencies – Harry Fox Agency being the incumbent on Mechanicals and ASCAP, BMI and SESAC on the Performance royalties – which adds yet another level of complication.

From what’s left (47p), you halve the resulting amount on a 50/50 deal. Neither the label nor the artist gets much for their work. On some artists whom we’ve purely done digital distribution for (on a rolling licence agreement), we give the artist 80% of net. As you can imagine, we get virtually nothing – and our income’s directly tied to their success, so we have an interest in seeing them do well. It’s a tough environment to be in.

For receiving US/Canadian/Mexico/European/Australasian payments, we first have to receive the currency and have the bank convert it to GBP. Of course, we can’t get the Interbank rates, nobody but the banks get those – so more money’s immediately lost in conversion. The larger labels will have sweetheart deals with their banks (or almost certainly have accounts in each relevant territory) so this isn’t so much of a big deal, but the amount of administration just scales inordinately. If you deal with managing your artists’ Publishing rights, you can quickly become LITERALLY swamped in paperwork. The amount of time sucked up by adminning the release of music is extraordinary.

So please nobody think all music labels have it easy… I have no doubt that the Big Four have royally shafted artists in the past but they can largely lumber along based on a few artists doing exceptionally well for the rest of their current roster (with their back catalogue from very famous artists helping too). The problem they’re going to have is that almost none of the artists whose catalogue’s been released in the past two decades *really* has the staying power of the classic artists – Dire Straits, Genesis, Pink Floyd, The Who or Fleetwood Mac, just to name five off the top of my head. Don’t even get me started on the epic fail that is streaming revenues from Spotify, mFlow, We7 etc.

Now even with all of this, I still regard sites like YouTube as a promotional tool. Some of our most famous catalogue I’ve held off on issuing DMCA takedowns for, because it’s a genuinely beneficial promotional tool – it’s the pragmatic response. Where do people go first if they want to quickly listen to a track? YouTube! What happens if they only ever wanted to hear it once and never again? You’ve not lost that sale because it almost certainly would never have happened. What happens if they still want to have a copy of that track? They’ll go buy it from one of the easily accessible venues, it’s not expensive to do. The label’s job is to make the catalogue ubiquitous on all of the major (and some of the trendier niche stores) where at all possible. The digital distribution costs are another thing the label has to absorb – monthly, per track, per store usually, if not on an aggregation deal where it’s a percentage on each sale but the label usually ends up worse off. It’s a tough position because the label almost always feels the need to protect their ‘content’ (shudder – hate that word) but issuing takedowns for every instance of a track is more often than not a kneejerk reaction which harms longterm sales. I’m personally torn between leaving them, taking them down or even putting up better mashup/promo mix versions on the label’s official account!

Treat your customers like adults and I think you earn their respect a bit more. This applies to all forms of digital media, including tellybox shows. (thesis: DRM = genuinely unhelpful towards nurturing that unique supportive viewer-provider relationship. Trust your customers, they’ll not disrespect you.) In music, nobody wants to buy a track if they can never audition it, and 30sec samples aren’t really a good enough.

Restoring Ubuntu 10.4’s Bootloader, after a Windows 7 Install

I installed Windows 7 after I had installed Ubuntu 10.4. Windows 7 overwrote the Linux bootloader “grub” on my master boot record. Therefore I had to restore it.

I used the Ubuntu 10.4 LiveCD to start up a live version of Ubuntu. While under the LiveCD, I then restored the Grub bootloader by chrooting into my old install, using the linux command line. This is a fairly complex thing to do, and so I recommend you use this approach only if you’re are confident with the linux command line:

# (as root under Ubuntu's LiveCD)

# prepare chroot directory

mkdir /chroot

# mount my linux partition

mount /dev/sda1 $d   # my linux partition was installed on my first SATA hard disk, on the first parition (hence sdA1).

# mount system directories inside the new chroot directory

mount -o bind /dev $d/dev
mount -o bind /sys $d/sys
mount -o bind /dev/shm $d/dev/shm
mount -o bind /proc $d/proc

# accomplish the chroot

chroot $d

# proceed to update the grub config file to include the option to boot into my new windows 7 install


# install grub with the new configuration options from the config file, to the master boot record on my first hard disk

grub-install /dev/sda

# close down the liveCD instance of linux, and boot from the newly restored grub bootloader


Windows 7 Gaming on my Macbook

I have a 2006/2007 Core 2 Duo 2.6ghz white macbook, that I use regularly for internet, music, watching films, itunes and integration with my iPhone.

I wanted to turn my desktop PC into a ‘work only’ Ubuntu Linux machine, so that I don’t get distracted when I’m supposed to be doing something else.

But I still have a lot of PC games that I wanted to play on the Macbook, so I decided to try and setup a windows environment to play games on using Bootcamp 2.0 to create a dual-boot OSX/Windows 7 configuration.

It turns out it works really well. The Macbook runs Windows 7 64-bit edition fine, and although the integrated graphics card isn’t designed to run modern games very well, you can get a good gaming experience from small indie games and the older type of PC RPGs that I tend to play. My macbook got a 3.5 rating on the windows experience index for graphics, which is sufficient for many PC games.

First you need to partition your macbook’s HD using the Bootcamp assistant, in the OSX utilities section. Make sure you have your first OSX installation DVD to hand, the one that came with your Macbook. I chose to split the hard drive into two equally sized partitions. Then just place your W7 DVD in the drive, and Bootcamp takes care of the rest.

Once W7 is installed, you can access the Bootcamp menu on startup by holding down the option key. This brings up a menu where you can select to boot into OSX or Windows.

When you start W7 for the first time, you can install the windows driver set for your Macbook that Bootcamp provides you. Insert your OSX installation DVD 1, and run the setup.exe that is located in the Bootcamp folder. This will install native windows drivers for your Macbook hardware.

The only change I needed to make for my macbook, was to install the latest 64bit Realtek drivers for Vista/Windows 7, which are located on the Realtek website. This will fix any sound problems you might have while playing games.

Now don’t expect to run the latest 3D games, but if you’re happy enough with slightly older, classic, indie or retro games, you can get a good gaming experience on Windows 7 from your macbook. It does well with plenty of the indie games available on Value’s Steam distribution network.

Ripping Movies onto the iPhone

I’m currently watching Persepolis, the 2008 animated film about a tomboy anarchist growing up in Iran. I’m watching this on my new iPhone 3GS, and the picture and audio quality is very good.

Here’s what I used to convert my newly bought Persepolis DVD, for watching on the iPhone.

1x Macbook (but you can use any intel mac)
1x iTunes
1x RipIt – Commercial Mac DVD Ripper (rips up to 10 DVDs on the free trial, $20 after)
1x Handbrake 32 – Freely available transcoder
1x VLC 32 – Freely available media player
1x DVD

* Ripit – rips the video and audio from the DVD, onto your computer
* Handbrake 32 – ‘transcodes’ the ripped video and audio, meaning – it converts it into an iPhone compatible video file.
* VLC 32 – is used by Handbrake 32 to get past any problems with converting the media.

Go to the following sites to fetch the software:

1. Ripit –
2. Handbrake 32 – (get the 32 bit version)
3. VLC 32 – (be sure to get the 32 bit version)

There’s currently a difficulty in getting the VLC 64 bit software for the Mac, and so although the 64 bit version is faster to use, you’re probably better off with 32 bit versions of both for now.

The Process

1) Rip the DVD.

Start RipIt. It will ask for a DVD, insert the DVD.. and point the resultant save location to the desktop. The ripping process takes about 40 minutes on my Macbook, you can check the progress by looking at the icon in the dock – it will be updated with the percentage of progress until completion. You can do other things on your mac while it’s ripping, even though the DVD drive will be occupied. Wait until it’s completed before continuing.

2) Transcode (convert) the ripped video file for use on the iPhone.

Start Handbrake. There are a bunch of transcoding settings called presets – those tell Handbrake what type of media player you want the converted video to work on. In handbrake on the right section of the window, select the iPhone preset. Then go to the file menu, select ‘Open’, and then select the video file that RipIt saved onto your desktop. Then select the destination for the converted video file. Then select the Start (green) button on Handbrake window, and it will start. You can now minimise handbrake and do other things. The transcoding process depends on the film, but takes about an hour on my Macbook. You can check on progress by maximizing the Handbrake window, and checking on the progress bar.

3) Move the converted video file onto your iPhone.

Once that’s done, you will have another media file on your desktop – this is the end result, a video file that will play on your iPhone. Simply connect your iPhone to your Mac, start up iTunes, and drag that file from your desktop into the iPhone icon on your iTunes window. It will take a couple of minutes to transfer, then eject the iPhone as normal

Now you can watch this new movie on your iPhone by going to the ‘Videos’ tab of your iPod app.

WordPress HTML edit mode inserts BR tags sometimes when you add a carriage return..

This is something that was quite annoying today, as I was struggling to use WordPress 2.9.2 to align some pictures in the HTML mode of editing a page, on a client’s website.

It turns out that WordPress was adding BR tags sometimes when I hit return.. and sometimes not. The annoying thing was, although the BRs were outputted in the resultant WordPress site, the BRs were not visible in the WordPress HTML edit mode itself.. meaning they were invisible and undetectable until I viewed the resultant website source and finally figured it out.

WordPress does insert some formatting tags now and then, it seems, but I would have thought it would tell you about the tags that would change the page layout! Apparently not. Anyway, something to be aware of for WordPress gurus..


I don’t have time to report this as a bug, but this is the stack I’m using for anyone interested:

Browser: Google Chrome for Mac (5.0.342.9 beta)
TinyMCE Advanced Editor Plugin for WP (3.2.7)
Wordpress 2.9.2

The beta of Google Chrome is a bit unstable, although it may not be the source of the problem.

Forkbombs and How to Prevent Them

A forkbomb is a program or script that continually creates new copies of itself, that create new copies of themselves. It’s usually a function that calls itself, and each time that function is called, it creates a new process to run the same function.

You end up with thousands of processes, all creating processes themselves, with an exponential growth. Soon it takes up all the resources of your server, and prevents anything else running on it.

Forkbombs are an example of a denial of service attack, because it completely locks up the server it’s run on.

More worryingly, on a lot of Linux distributions, you can run a forkbomb as any user that has an account on that server. So for example, if you give your friend an account on your server, he can crash it/lock it up whenever he wants to, with the following shell script forkbomb:

:(){ :|:& };:

Bad, huh?

Ubuntu server 9.10 is vulnerable to this shell script forkbomb. Run it on your linux server as any user, and it will lock it up.

This is something I wanted to fix right away on all my linux servers. Linux is meant to be multiuser, and it has a secure and structured permissions system allowing dozens of users to log in and do their work, at the same time. However when any one user can lock up the entire server, this is not good for a multiuser environment.

Fortunately, fixing this on ubuntu server 9.10 is quite simple. You limit the maximum number of running processes that any user can create. So the fork bomb runs, but hits this ceiling, and eventually stops without the administrator having to do anything.

As root, edit this file, and add the following line:


*               soft    nproc   35

This sets the maximum process cap for all users, to be 35. The root user isn’t affected by this limit. This limit of 35 should be fine for remote servers that are not offering users gnome, kde, or any other graphical X interface. If you are expecting your users to be able to run applications like that, you may want to increase the limit to 50, and although this will increase the time forkbombs will take to exit, they should still exit without locking up your server.

Alternatively, you can setup an ‘untrusted’ and ‘trusted’ user groups, and assign that 35 limit to the untrusted users, giving trusted users access to the trusted group, which does not have that limit. Use these lines:


@untrusted               soft    nproc   35
@trusted               soft    nproc   50

I’ve tested these nproc limits on 8.10 and 9.10 ubuntu-server installs, but you should really test your own servers install, if possible, by forkbombing it yourself as a standard user, using the bash forkbomb above, once you’ve applied the fix. The fix is effective as soon as you’ve edited that file, but please note that you have to logout, and log back in again as a standard user before the new process cap is applied to your user account.

How to remove nano, vim and other editors’ backup files out of a directory tree

gardening for science..

Linux command-line editors such as nano and vim often, by default, create backup files with the prefix of “~”. I.e, if I created a file called /home/david/myfile, then nano would create a backup in /home/david/myfile~. Sometimes it doesn’t delete them either, so you’re left with a bunch of backup files all over the place, especially if you’re editing a lot on a directory tree full of source code.

Those stray backup files make directory listings confusing, and also add unnecessary weight to the commits on source control systems such as svn, cvs, git.. etc. If you’re working on a programming team with other people, then it causes further problems and confusion, because person A’s editor can accidentally load person B’s backup file.. etc etc. Nightmare.

So instruct your editor, or the programming team you’re working with, not to drop these backup files. You can configure most editors to change the place where the editor drops its backup files, so you could store all your backup files in a subdirectory of your home directory, for example, if needed. However I always set my editors not to leave backup files about.

Once you know that new backup files will not be created, view the current list of backup files, along with the user that created them.. so you know who’s been creating the backup files and when, etc:

find . -name '*~' -type f -exec ls -al {}  ;

Then archive the stray backup files, with this command:

find . -name '*~' -type f -exec mv -i {} ./archived-backups ;

That will find all backup files in the current directory and below, and move them all to a subdirectory in the current directory called ‘archived-backups’. This is a fairly safe find/exec command, because with the -i switch, mv will not ‘clobber’. This means If you have two backup files, one in /opt/code/index~ and one in /opt/code/bla/bla/index~, they will not ‘clobber’, or overwrite each other automatically when moved into the new directory. You will be informed of any conflicts present so you can resolve them yourself.

However in practice I usually omit the ‘-i’ switch and let them clobber each other, because I usually end up deleting the ./archived-backups/ directory very quickly after that anyway.

Tip for watching the completion of a large file copy

Forget the wonderful windows progress bar, and imagine I’m in the world of command-line Linux, and I want to copy a 484MB file, called VMware-server-2.0.2-203138.i386.tar.gz, from my home directory to a remote server. But I want to figure out how long it’s going to take.

1. First I can run a “du -m” command to get the total MB size of the original file:

du -m /home/david/VMware-server-2.0.2-203138.i386.tar.gz


david@believe:~$ du -m VMware-server-2.0.2-203138.i386.tar.gz
484 VMware-server-2.0.2-203138.i386.tar.gz

Now I know it is approximately 484MB.

2. Then I run the copy. I’m copying the file from /home/david/ to /opt/remote/myserver, which is a remotely mounted directory on a server somewhere in Canada.

david@believe:~$ cp ./VMware-server-2.0.2-203138.i386.tar.gz /opt/remote/myserver/

At this point cp will just hang until it’s finished. There is normally no progress indicator or anything. But I want to figure out how much of the file has been copied, so I can figure out how much is left to copy, and get a rough idea of the progress.

3. So I SSH into the remote server in Canada, and run this command

david@myserver:~$ watch du -m ./VMware-server-2.0.2-203138.i386.tar.gz

the copy command by default seems to be incremental, ie: piece by piece, not all at once. Therefore with the “Watch” command, you can watch the size, in MB, of the new file as it accumulates. The watch command will refresh every 2 seconds, so you’ll be updated as the copy goes on.

You can probably invoke a progress meter with the cp command, or use rsync. Rsync is much better for large file copies, and remote file copies. But the advantage of the method above is that you can watch file copies already executed without any special arguments, which I sometimes find very useful when I remember that that file I already started copying isn’t 200MB.. it’s actually 2.5GB.

The Linux Root Directory, Explained

It’s helpful to know the basic filesystem on a Linux machine, to better understand where everything is supposed to go, and where you should start looking if you want to find a certain file.

Everything in Linux is stored in the “root directory”. On a windows machine, that would be equivalent to C:. C: is the main folder where everything is stored. On Linux we call this the “root directory”, or simply “/”. To go up to this root directory, type:

cd /

To list all the folders and files in the root directory, type this:

ls /

Alternatively, if you want to see the folders and files exactly the way I see them below for easy comparison, type this:

ls -lhaFtr --color /

Once you’ve typed in one of the ‘ls’ commands above, you’ll see some information similar to that on the screenshot below.. (please scroll down)..

Ubuntu Linux

Above you can see the files and folders in the root directory of my ubuntu linux server, after I’ve typed ‘ls /’. Ignore everything but the coloured names on the right, those coloured names are the names of the files and folders in this directory. Don’t worry about the shades of different colours either. It’s not really important to explain how they are coloured right now, just to explain the purpose behind each file or folder shown.

So let me explain the purpose behind each of these, in turn. I’ll include the same screenshot multiple times, so you can reference the explanations against it as you scroll down.


– Directory for linux security features, rarely visited by normal users like you or me.


– Traditional directory for the files from removable media, ie USB keys, external hard drives. Not used anymore, it only exists for historical purposes.


– Directory where files and directories end up when they’ve been recovered from a hard disc repair.

 cdrom -> media/cdrom/

– Link the files currently in your CDROM or DVDROM drive.


– New style directory for the files from removable media such as USB keys, external hard drives, etc. This is the new convention, and so you should always use media/ instead of mnt/, above.

vmlinuz.old -> boot/vmlinuz-2.6.31-17-generic

– A backup of your most recent old Linux operating system kernel, ie: your operating system. Don’t delete this =)

initrd.img.old -> boot/initrd.img-2.6.31-17-generic

– Another part of the backup for your most recent old Linux kernel.


– An empty directory reserved for you to put third-party programs and software in.


– Operating system drivers and kernel modules live here. Also contains all system libraries, so when you compile a new program from the source code, it will use the existing code libraries stored here.


– Basic commands that everyone uses, like “ls” and “cd”, live here.


– This is where all user-supplied software should go; ie: software that you install that doesn’t normally come with the operating system. Put all programs here.


– Basic but essential system administration commands that the admin user only uses, ie: reboot, poweroff, etc.

vmlinuz -> boot/vmlinuz-2.6.31-20-generic

– Your actual operating system kernel, ie: the one that is running right now. Don’t delete this.

initrd.img -> boot/initrd.img-2.6.31-20-generic

– Another part of the kernel that is running right now.


– Reserved for Linux kernel files, and other things that need to be loaded on bootup. Don’t touch these.


– Proc is a handy way of accessing critical operating system information, through a bunch of files. Ie: try typing ‘cat /proc/cpuinfo’. That queries the current kernel for the information on your processors (CPUs), and returns the info for you in a text file.


– Like proc/, this is another bunch of files that aren’t files at all, but ‘fake’ files. When you access them, the operating system goes away and finds out information, and offers that information up as a text file to you.


– Device files. In here live the device files for your hard drives, your CD/DVD drives, your soundcard, your network card.. in fact anything you have installed that Linux uses, it has a counterpart in here that is automatically added and removed by the OS. Don’t ever delete, move or rename any of the files here.


– The directory that you’ll use the most. Every user on your Linux machine, except the system administrator, has a folder here. This is where each user is meant to store all their documents. Think of it as the Linux ‘My Documents’ folder.


– This is a catch-all directory for ‘variables’, ie things that the OS has to write to, and vary, as part of its operation. Examples include: email inboxes for all users, cache files, the lock files that are generated and removed as part of normal program execution, and also the /var/www directory. /var/www is a directory you will probably see and use a lot, as it is where all the websites are stored that your linux machine serves when operating as a web server. /var/log is also a very important directory, and contains ‘log’ files which is a kind of “diary” that the linux OS uses to explain exactly what it’s done, as it happens, so you can easily find out what’s been going on by viewing the right log file.


– The space for any and all temporary files. Store files here that you want to throw away quite quickly. Depending on your configuration, all files and folders in the /tmp directory may be deleted on system reboot, or more frequently, perhaps every day.


– This is the system administrators ‘my documents’ folder. Anything that the sysadmin stores, for example: programs that he downloads, are put here. Not accessible to anyone else but the system administrator.


– Configuration files. Any and all program configuration files or information belong here. Think of it like the windows registry, except every registry entry is a text file that you can open up and edit, and also copy, move around, and save. You will typically have to create configuration files yourself sometimes, and put them in this directory. They are almost always simple text files.

And that’s a basic overview of the files and folders in the root directory of your linux machine.

Useful OSX commands for Linux users

I wrote this list to remind me, as a newcomer to OSX, how the command line differed from the Linux commandline. I thought I’d expand on it, and share it:

To mount any iso:

hdiutil mount sample.iso

To download a file as you would using wget:

curl -o linuxmint.iso -C -

the -o specifies the output file (required)
the -C – specifies automatically resuming if possible.

To burn a bootable iso to CD, DVD or USB key:

use the “diskutil” program as described in:

Monitor disk io utilisation.. poll once per second

iostat -c 99999

will run until 99999 seconds have passed.

Monitor CPU and memory utilisation.. polling per second


Just like Linux.

Mount Windows Shares

mount -t smbfs //@/ 


mount -t smbfs //davec@SERVER/Dev samba-to-netdev

then it will appear mounted in /Volumes with the mount point name you supplied, ie: /Volumes/samba-to-netdev/.

Long Bash History Files are Great.

When I’m installing software, or doing some complicated stuff on the linux command line, which nowadays is pretty much all the time, I will sometimes want to remember exactly what I typed.

Now the normal /home/david/.bash_history file is usually fine for that. Run this command, for example, and you will see the commands you typed in before you logged out of the server last time you used it:

cat ~/.bash_history

You can also find out what you typed in this session, ie: since you logged in, by typing this:


This is great, and it’s even more useful if you add a grep pipeline, so you can search through the previous commands you typed in for a particular phrase or command, ie:

history | grep apt-get

However what I really want nowadays is an almost infinite bash_history file, so I can find out not just what I did last week, but two weeks ago, or last month or perhaps last year. Now there are obvious security risks involved with this, and to make sure you don’t accidently store mistyped passwords to other systems, or other things, you should probably make sure you never type them in on the command line. This is good practice anyway, and since I use key’d sshd logins exclusively nowadays, there is not much chance of me tripping up, typing a password into the terminal, and then forgetting about it. In theory however, using long/infinite bash_history files does mean that if anyone compromised your shell account, they’d have any passwords to systems that you mistyped.

So I’m careful with this. You can also clear your history file quite quickly if you do accidently find you’ve messed up. Log out, log back in again, and just do this:

echo  > ~/.bash_history

Then that will delete all the previous logged commands.

Apart from serving as a major memory aid to complicated install work, and a log for those increasingly complicated chained, piped, one-liners that I’m fond of but only really want to have to type once, there are other benefits to keeping a large bash_history file. The main one is that it makes it easy to convert your previous commands into a handy shell script or two, which you can set to run at a specific time of day via cron.. or even make into a system-wide command for other users to use.

OK so hopefully I’ve convinced you that it can be very useful to have a long, persistent, bash_history file. But how do you configure the shell so that it does this for you? The following is the magic customization lines that I use on my personal desktops, laptops, and any other trusted computers that I think are reasonably free from the risk of people hacking in just to retrieve my .bash_history file..:

## bash history db
# increase the history file size to 20,000 lines
export HISTSIZE=20000
# append all commands to the history file, don't overwrite it at the start of every new session
shopt -s histappend

The above will give you an (almost) infinite bash_history file. It will start deleting old commands at 20,000 lines, ie: 20,000 commands. Make sure you have enough disk space for that. My .bash_history file is currently at around 200KB, not a huge file by any means. I’d say it will grow to 400-600KB max. If you want to calculate approximatly how much it will use, then in bytes, it’s the number of characters in your average linux command x 20,000.

My minimal VIM config

This is the absolute minimum I do when I have to log onto a new server or shell account that I haven’t used before, that I will need to edit text files with.

First I figure out whether VIM is really installed. A lot of installs, especially those based on ubuntu, ship with VI aliased to VIM, but the VIM install is usually not really VIM at all, and behaves exactly like VI but with some minor bugs fixed. This is not what I want.

So first I figure out what distribution of linux I’m using through executing the following command:

cat /etc/issue

Then if it’s ubuntu, which doesn’t ship with the full VIM package on a lot of default installs, then I usually do this, presuming I have admin access. In practice I usually have admin access because people are generous with this when they want you to fix their server =) Anyway, if I have admin access, I install ubuntu’s ‘vim full’ package, which is aliased as ‘vim’:

sudo apt-get install vim

Now I can move onto my config. Occasionally there will be a global system config, but I probably want to override that anyway. So I create a vim configuration file specific to me in my home directory:

set bg=dark
set backspace=2

The first line sets the background to be dark, so I can see what is going on when I use a dark terminal program, such as putty, mac osx’s terminal.. in fact nearly all terminal programs use a dark background, so this setting is almost compulsory.

The second line configures the behaviour of the backspace key, so when I go the the start of a line, and press backspace, it adopts the wordprocessor conventional behaviour of skipping to the above line. Otherwise it uses the default VI behaviour, which is probably not intuitive at all to anyone who didn’t grow up on UNIX mainframes and such.

The very existence of a user-supplied configuration file will also jolt the VIM editor into ‘non compatible mode’, where it figures out automatically that it should be doing all the advanced VIM things, instead of just acting as a VI replacement. This should mean that if you create a config file, syntax highlighting is already turned on, another must for me. Otherwise you can explicitly set it with the line ‘syntax on’, but I never have to do this anymore.

And that’s it.

Using the Linux command ‘Watch’ to test Cron jobs and more

OK, so you have added a cron job that you want to perform a routine task every day at 6am. How do you test it?

You probably don’t want to spend all night waiting for it to execute, and there’s every chance that when it does execute, you won’t be able to find out whether it is executing properly – the task might take 30 minutes to run, for example. So every time you debug it and want to test it again, you have to wait until 6am the following day.

So instead, configure that cron job to run a bit earlier than that, say in 10 minutes, and monitor the execution with a ‘watch’ command, so you can see if it’s doing what you want it to.

‘watch’ is a great command that will run a command at frequent intervals, by default, every 2 seconds. It’s very useful when chained with the ‘ps’ command, like the following:

watch 'ps aux | grep bash'

What that command will do, is continually monitor your server, and maintain an updated list that changes every 2 seconds, of every instance of the bash shell. When someone logs in and spawns a new bash shell, you’ll know about it. When a cron’d command runs that invokes a bash shell before executing a shellscript, you’ll know about it. When someone writes a badly written shell script, and runs it invoking about 100 bash shells by accident, flooding your servers memory, you’ll know about it.

OK so back to the cron example. Suppose I’m testing a cronjob that should invoke a shell script that runs an rsync command. I just set the cron job to run in 5 minutes, then run this command:

watch 'ps aux | grep rsync'

Here is the result.. every single rsync command that is running on my server is displayed, and the list is updated every 2 seconds:

Every 2.0s: ps aux | grep rsync                                              Sat Mar 13 15:59:35 2010

root     16026  0.0  0.0   1752   480 ?        Ss   15:28   0:00 /bin/sh -c /opt/remote/rsync-matt/cr
root     16027  0.0  0.0   1752   488 ?        S    15:28   0:00 /bin/sh /opt/remote/rsync-matt/crond
root     16032  0.0  0.1   3632  1176 ?        S    15:28   0:00 rsync -avvz --remove-source-files -P
root     16033  0.5  0.4   7308  4436 ?        R    15:28   0:09 ssh -l david someotherhost rsync --se
root     16045  0.4  0.1   4152  1244 ?        S    15:28   0:07 rsync -avvz --remove-source-files -P
root     18184  0.0  0.1   3176  1000 pts/2    R+   15:59   0:00 watch ps aux | grep rsync
root     18197  0.0  0.0   3176   296 pts/2    S+   15:59   0:00 watch ps aux | grep rsync
root     18198  0.0  0.0   1752   484 pts/2    S+   15:59   0:00 sh -c ps aux | grep rsync

Now I can see the time ticking away, and when the cron job is run, I can watch in real-time as it invokes rsync, and I can keep monitoring it to make sure all is running smoothly. This proves to be very useful when troubleshooting cron jobs.

You can also run two commands at the same time. You can actually tail a log file and combine it with the process monitoring like so:

watch 'tail /var/log/messages && ps aux | grep rsync'

Try this yourself. It constantly prints out the last ten lines of the standard messages log file every two seconds, while monitoring the number of rsync processes running, and the commands used to invoke them. Tailor it to the cron’d job you wish to test.

Watch can be used to keep an eye on other things also. If you’re running a multi-user server and you want to see who’s logged on at any one time, you can run this command:

watch 'echo CURRENT: && who && echo LASTLOGIN: && lastlog | grep -v Never'

This chains 5 commands together. It will keep you updated with the current list of users logged in to your system, and it will also give you a constantly updated list of those users who have ever logged in before, with their last login time.

The following shows the output of that command above on a multi-user server I administrate, and will refresh with current information every 2 seconds until I exit it:

Every 2.0s: echo CURRENT: && who && echo LASTLOGIN: && lastlog | grep -v Never                                                             Sat Mar 13 07:48:32 2010

mark     tty1         2010-02-23 11:08
david    pts/2        2010-03-13 07:48 (wherever)
mike     pts/4        2010-02-26 07:53 (wherever)
mike     pts/5        2010-02-26 07:53 (wherever)

Username         Port     From           Latest
mark               pts/6    wherever      Thu Mar 11 23:24:36 -0800 2010
mike               pts/0    wherever      Sat Mar 13 03:54:28 -0800 2010
dan                pts/4    wherever      Fri Jan  1 08:46:29 -0800 2010
sam                pts/1    wherever      Sat Jan 30 08:06:01 -0800 2010
rei                pts/2    wherever      Thu Dec 10 11:45:39 -0800 2009
david              pts/2    wherever      Sat Mar 13 07:48:05 -0800 2010

This shows that mark, david and mike are currently logged on. Mark is logged in on the server’s physical monitor and keyboard(tty1). Everyone else is logged in remotely. Mike currently has two connections, or sessions, on the server. We can also see the list of users that have logged in before – ie: are active users, and when they last logged on. I immediately notice, for example, that rei hasn’t logged in for 4 months and probably isn’t using her account.

(Normally this command will also provide IP addresses and hostnames of where the users have logged on from, but I’ve replaced those with ‘wherever’ for privacy reasons)

So.. you can see that the ‘watch’ command can be a useful window into what is happening, in real-time, on your servers.

Changing the default “From:” email address for emails sent via PHP on Linux

I’ve had to solve this problem a couple of times at least, and it’s quite a common task, so I thought I’d document it here.

When you send emails to users of your site through using the PHP mail() function, they will sometimes turn up in the mailbox of customers of your site with the following from address:

From: Root <>

This makes absolutely no sense to your customers, and often they will think it is spam and delete it. Often, the decision will be made for them by their web mail host, such as or, and they will never even see the email. You don’t want this to happen.

Writing email templates that appear “trustworthy” and have a low chance of being mislabled as spam by the webmail companies, is quite a difficult task, and there’s quite a bit to know about it. However it is quite easy to change the default “From:” email address that PHP sends your emails on as, and that will definitely help.

Assuming you’re running a linux server using sendmail, all you have to do is this.

First create an email address that you would want the customers to see, through editing the /etc/aliases files and running the command newaliases. I created an email address called

Then change the following sendmail_path line in your php.ini file to something like this:

sendmail_path = /usr/sbin/sendmail -t -i -F 'customer-emails' -f 'Customer Emails <>'

Broken down, those extra options are:
-F 'customer-emails' # the name of the sender
-f 'Customer Emails <>' # the email From header, which should have the name matching the email address, and it should be the same email address as above

Then restart apache, and it should load the php.ini file changes. Test it by sending a couple of emails to your email address, and you should see emails sent out like this:

From: Customer Emails <>

Shell scripts for converting between Unix and Windows text file formats

I’ve been using these shell scripts I wrote to convert between unix and windows text file formats. They seem to work well without any problems. If you put them in the /usr/sbin/ directory, they will be accessible on the path of the linux admin account root.

# Converts a unix text file to a windows text file.
# usage: unix2win <text file to convert>
# requirements: sed version 4.2 or later, check with sed --version
sed -i -e 's/$/r/' $1

# Converts a windows text file to a unix text file.
# usage: win2unix <text file to convert>
cat $1 | tr -d '15' | tee $1 >/dev/null

I use these scripts with the combination of find and xargs to convert lots of log files into windows format with the following command. However this type of command can be dangerous, so don’t use it if you don’t know what you’re doing:

find sync-logs/ -name '*.log' -type f | xargs -n1 unix2win

Site Redesign

I’ve just updated the design of this blog, re-enabled comments and added a contact tab. I’ve installed a strong anti-spam comment filter, but you should now be able to comment on entries. I’ve also changed the layout of things slightly, and made it easier to read.

PHP Sample – HTML Page Fetcher and Parser

Back in 2008, I wrote a PHP class that fetched an arbitary URL, parsed it, and coverted it into an PHP object with different attributes for the different elements of the page. I recently updated it and sent it along to a company that wanted a programming example to show I could code in PHP.

I thought someone may well find a use for it – I’ve used the class in several different web scraping applications, and I found it handy. From the readme:

This is a class I wrote back in 2008 to help me pull down and parse HTML pages I updated it on
14/01/10 to print the results in a nicer way to the commandline.

- David Craddock (


It uses CURL to pull down a page from a URL, and sorts it into a 'Page' object
which has different attributes for the different HTML properties of the page
structure. By default it will also print the page object's properties neatly
onto the commandline as part of its unit test.


* README.txt - this file
* page.php - The PHP Class
* LIB_http.php - a lightweight external library that I used. It is just a very light wrapper around CURL's HTTP functions.
* expected-result.txt - output of the unit tests on my development machine
* curl-cookie-jar.txt - this file will be created when you run the page.php's unit test


You will need CURL installed, PHP's DOMXPATH functions available, and the PHP 
command line interface. It was tested on PHP5 on OSX.


Use the php commandline executable to run the page.php unit tests. IE:
$ php page.php

You should see a bunch of information being printed out, you can use:
$ php page.php > result.txt

That will output the info to result.txt so you can read it at will.

Here’s an example of one of the unit tests, which fetches this frontpage and parses it:

*** Page Print of ***

** Transfer Status
+ URL Retrieved:
+ CURL Fetch Status:
    [url] =>
    [content_type] => text/html; charset=UTF-8
    [http_code] => 200
    [header_size] => 237
    [request_size] => 175
    [filetime] => -1
    [ssl_verify_result] => 0
    [redirect_count] => 0
    [total_time] => 1.490972
    [namelookup_time] => 5.3E-5
    [connect_time] => 0.175803
    [pretransfer_time] => 0.175812
    [size_upload] => 0
    [size_download] => 30416
    [speed_download] => 20400
    [speed_upload] => 0
    [download_content_length] => 30416
    [upload_content_length] => 0
    [starttransfer_time] => 0.714943
    [redirect_time] => 0

** Header
+ Title: Random Eye Movement  
+ Meta Desc:
Not Set
+ Meta Keywords:
Not Set
+ Meta Robots:
Not Set
** Flags
+ Has Frames?:
+ Has body content been parsed?:

** Non Html Tags
+ Tags scanned for:
Tag Type: script tags processed: 4
Tag Type: embed tags processed: 1
Tag Type: style tags processed: 0

+ Tag contents:
    [ script ] => Array
            [0] => Array
                    [src] =>
                    [type] => 
                    [isinline] => 
                    [content] => 

            [1] => Array
                    [src] =>
                    [type] => text/javascript
                    [isinline] => 
                    [content] => 

            [2] => Array
                    [src] => 
                    [type] => 
                    [isinline] => 1
                    [content] => 
                 var odesk_widgets_width = 340;
                var odesk_widgets_height = 230;

            [3] => Array
                    [src] =>
                    [type] => 
                    [isinline] => 
                    [content] => 

            [count] => 4

    [ embed ] => Array
            [0] => Array
                    [src] =>
                    [type] => application/x-shockwave-flash
                    [isinline] => 
                    [content] => 

            [count] => 1

    [ style ] => Array
            [count] => 0


*** Page Print of Finished ***

If you want to download a copy, the file is below. If you find it useful for you, a pingback would be appreciated.

Config files for the Windows version of VIM

Today I encountered problems configuring the windows version of the popular text editor VIM, so I thought I’d write up a quick post talk about configuration files under the Windows version, if anyone becomes stuck like I did. I use Linux, OSX and Windows on a day-to-day basis, and VIM as a text editor for a lot of quick edits on all three platforms. Here’s a quick comparison:


Linux is easy because that’s what most people who use VIM run, and so it is very well tested.

~/.vimrc – Configuration file for command line vim.
~/.gvimrc – Configuration file for gui vim.


OSX is simple also, as it’s based on unix:

~/.vimrc – Configuration file for command line vim.
~/.gvimrc – Configuration file for gui vim.


Windows is not easy at all.. it doesn’t have a unix file structure, and doesn’t have support for the unix hidden file names, that start with a ‘.’, ie: ‘.vimrc’, ‘.bashrc’, and so on. Most open-source programs like VIM that require these hidden configuration files, and have been ported over to windows, seem to adopt this naming convention: ‘_vimrc’, ‘_bashrc’.. and so forth. So:

_vimrc – Configuration file for command line vim.
_gvimrc – Configuration file for gui vim.

Renaming configuration files from “.” to “_” wouldn’t make much difference on its own. You’d have to rename your files, but.. big deal. It’s not much of a problem.

Another, more tricky, problem you may encounter however, is that there’s no clear home directory on windows systems. Each major incarnation of windows seems to have a slightly different way of dealing with user’s files.. from 2000 to XP, a change, from XP to Vista, there is a change. I haven’t tried VIM on W7 yet, but it seems similar to Vista in structure, so this information may actually be consistent to W7.

The Vista 64 version of VIM I have, looks in another place for configuration files. For a global configuration file, it looks in “C:Program Files”. Yes.. “C:Program Files”. According to Vista 64’s version of VIM.. that’s the exact directory where I installed VIM. This is clearly not right. What’s happening is that the file system on windows is different to the unix-type file systems, and the VIM port is having problems adapting. The real VIM install directory is C:Program Filesvim72. Because VIM is looking for a global configuration file in “C:Program Files_vimrc”, it’ll never find it.

Now you could override this with a batch file that sets the right environmental variables on startup, or you could change the environmental variables exported in windows, but I prefer to have a user-specified configuration file in my personal files directory, as it’s easier to backup and manage. If you wanted to specify the environmental variables yourself, which I’m guessing many will, the two environmental variables to override are:

$VIM = the VIM install directory, not always set properly, as I mentioned.
$HOME = the logged in user’s documents and settings directory, in windows speak this is also where the ‘user profile’ is stored, which is a collection of settings and configurations for the user. The exact directory will depend on which version of Windows you’re running, and if you override the HOME folder, you may have problems with other programs that rely on it being static.

On my Windows Vista 64 install:

$VIM = “C:Program Files”
$HOME = “C:UsersDave”

You can see what files VIM includes by running the handy command

vim -V

at a command prompt; it will go through the different settings and output something similar to this:

Searching for "C:UsersDave/vimfilesfiletype.vim"
Searching for "C:Program Files/vimfilesfiletype.vim"
Searching for "C:Program Filesvim72filetype.vim"
line 49: sourcing "C:Program Filesvim72filetype.vim"
finished sourcing C:Program Filesvim72filetype.vim
continuing in C:UsersDave_vimrc
Searching for "C:Program Files/vimfiles/afterfiletype.vim"
Searching for "C:UsersDave/vimfiles/afterfiletype.vim"
Searching for "ftplugin.vim" in "C:UsersDave/vimfiles,C:Program Files/vimfiles,C:Program Filesvim72,C:Program Files/vimfiles/after,C:UsersDave/vimfiles/after"
Searching for "C:UsersDave/vimfilesftplugin.vim"
Searching for "C:Program Files/vimfilesftplugin.vim"
Searching for "C:Program Filesvim72ftplugin.vim"
line 49: sourcing "C:Program Filesvim72ftplugin.vim"
finished sourcing C:Program Filesvim72ftplugin.vim
continuing in C:UsersDave_vimrc
Searching for "C:Program Files/vimfiles/afterftplugin.vim"
Searching for "C:UsersDave/vimfiles/afterftplugin.vim"
finished sourcing $HOME_vimrc
Searching for "plugin/**/*.vim" in "C:UsersDave/vimfiles,C:Program Files/vimfiles,C:Program Filesvim72,C:Program Files/vimfiles/after,C:UsersDave/vimfiles/after"
Searching for "C:UsersDave/vimfilesplugin/**/*.vim"
Searching for "C:Program Files/vimfilesplugin/**/*.vim"
Searching for "C:Program Filesvim72plugin/**/*.vim"
sourcing "C:Program Filesvim72plugingetscriptPlugin.vim"
finished sourcing C:Program Filesvim72plugingetscriptPlugin.vim
sourcing "C:Program Filesvim72plugingzip.vim"
finished sourcing C:Program Filesvim72plugingzip.vim
sourcing "C:Program Filesvim72pluginmatchparen.vim"
finished sourcing C:Program Filesvim72pluginmatchparen.vim
sourcing "C:Program Filesvim72pluginnetrwPlugin.vim"
finished sourcing C:Program Filesvim72pluginnetrwPlugin.vim
sourcing "C:Program Filesvim72pluginrrhelper.vim"
finished sourcing C:Program Filesvim72pluginrrhelper.vim
sourcing "C:Program Filesvim72pluginspellfile.vim"
finished sourcing C:Program Filesvim72pluginspellfile.vim
sourcing "C:Program Filesvim72plugintarPlugin.vim"
finished sourcing C:Program Filesvim72plugintarPlugin.vim
sourcing "C:Program Filesvim72plugintohtml.vim"
finished sourcing C:Program Filesvim72plugintohtml.vim
sourcing "C:Program Filesvim72pluginvimballPlugin.vim"
finished sourcing C:Program Filesvim72pluginvimballPlugin.vim
sourcing "C:Program Filesvim72pluginzipPlugin.vim"
finished sourcing C:Program Filesvim72pluginzipPlugin.vim
Searching for "C:Program Files/vimfiles/afterplugin/**/*.vim"
Searching for "C:UsersDave/vimfiles/afterplugin/**/*.vim"
Reading viminfo file "C:UsersDave_viminfo" info
Press ENTER or type command to continue

Notice how it does pull in all the syntax highlighting macros and other extension files correctly, which are specified in the .vim files above.. but it doesn’t pull in the global configuration files that I’ve copied also to C:Program Filesvim72_gvimrc and C:Program Filesvim72_vimrc. However, it does pickup the files I copied to C:UsersDave.. both the C:UsersDave_vimrc and C:UsersDave_gvimrc are picked up, although VIM will normally read ‘_gvimrc’ when the gui version of VIM is run (called gvim).

To see exactly what those environmental variables are being set to, when you’re inside the editor, issue these two commands, and their values will be show in the editor:

:echo $HOME
:echo $VIM

It seems to make sense for me – and perhaps you, if you’re working with VIM on windows – to place my _vimrc and _gvimrc files configuration files in $HOME in Vista. They are then picked up without having to worry about explicitly defining any environmental variables, creating a batch file, or any other hassle.

You can do this easily by the following two commands:

:ed $HOME_vimrc
:sp $HOME_gvimrc

That will open the two new configuration files, side by side, and you can paste in your existing configuration that you’ve used in Linux, and windows will pick them up the next time you start VIM.

Regex in VIM.. simple

There are more than a gazillion ways to use regexs. I am sure they are each very useful for their own subset of problems. The sheer variety can be highly confusing and scary for a lot of people though, and you only need to use a few approaches to accomplish most text-editing tasks.

Here is a simple method for using regex in the powerful text editor VIM that will work well for common use.


We are going to take the “search and delete a word” problem for an example. We want to delete all instances of the singular noun “needle” in a text file. Let’s assume there are no instances of the pluralisation “needles” in our document.

  1. Debug on.. turn some VIM options on
    :set hlsearch
    :set wrapscan
    – this will make all regex expressions possible to debug by visually showing what they match in your document (first line) and make all searches wrap around instead of just search forward from your current position, which is the default. (second line)
  2. Develop and Test.. your regex attempts by using a simple search. Here we see three attempts at solving the problem: :/needl
    – our third try is correct, and highlights all words that spell “needle”. The < and > markers allow you to specify the beginning and the end of a word. Play with different regexs using the simple search and watching what is highlighted, until you discover one that works for you.

  3. Run… your regex:%s/<needle>//g – once you’ve figured out a regex, run the regex on your document. This example will execute a search for the word “needle” and delete every one. If you wanted to substitute needle for another word, you would put the word in between the // marks. As we can see, there is nothing between the marks in this example, so it will replace instances of “needle” with nothing. This means it will serve to delete every instance of the word “needle”.
  4. Check things are OK… with your document :/<needle>
    – has the regex done what you want? Use the search function to see if regex has done what you wanted it to do. The above examples show different searches through the document to see if different variations remain. Any matches of these searches will highlight any problems. You can use the lower-case N(next search result) and lower-case P(previous search result) commands to navigate through any found search results. You must remember to manually look through the document and see what the regex has changed, make sure there aren’t any unwanted surprises!
  5. Recover… from any mistakes u – just press the U key (with no capslock or shift). This will undo the very last last change made to the document.
  6. Redo… any work that you need to <ctrl>-r – use the redo fuction; press the CONTROL and R keys together (with no capslock or shift). This will redo the last change made to the document.
  7. Finish up and Write… to file :w – write your work on the document to file. Even after you have written out to file, you can probably still use the undo function to get back to where you were, but it’s best practice to not rely on this, and only write once you’re done.
  8. Debug off.. turn some options off
    :set nohlsearch
    :set nowrapscan
    – turn off the regular expression highlighting (line 1). turn off the wraparound searching (line 2). You can leave either or both options on if you want, they’re often useful. Up to you.

Use a combination of these wonderful commands to test and improve your regex development skills in VIM.


Here I use the shorthand “#…” to denote comments on what I’m doing… if you want to copy and paste the example as written, then you will have to remove those comments.

1. Remove ancient uppercase <BLINK> tags from a document.

:set wrapscan # debug on
:set hlsearch # debug on
:/<BLINK> # try 1.. bingo! first time.. selected all tags I want
:%s/<BLINK>//g # lets execute my regex remove
:/BLINK # check 1.. testing things are OK in my file by searching through..
:/blinked # check 2.. yep thats ok..
:/<BLINK> # check 3.. yep looks ok... the problem tags are gone
# ...manual scroll through the document.. looks good!
:w # write out to file
:set nohlsearch # debug off
:set nowrapscan # debug off

2. Oh no! We missed some lower and mixedcase <bLiNK> tags that some sneaky person slipped in. Let’s take them out.

:set wrapscan # debug on
:set hlsearch # debug on
:/<blink> # try 1.. hm.. worked for many, but didnt match BlInK or blINK mixedcase
:/<blink>/i # try 2.. much better.. seems to have worked!
:%s/<blink>//i # lets execute my regex remove
:/BLINK # check 1.. testing things are OK in my file by searching through..
:/blinked # check 2.. yep thats ok..
:/<blink> # check 3.. yep thats fine.
:/<blink>/i # check 4.. looks good... problem solved
# ...manual scroll through the document.. looks much better!
:w # write out to file
:set nohlsearch # debug off
:set nowrapscan # debug off

3. Replacing uppercase or mixedcase <BR> tags with the more modern <br>.

:set wrapscan # debug on
:set hlsearch # debug on
:/<BR> # try 1.. hmm.. just uppercase.. not gonna work..
:/<br> # try 2.. hmm.. just lowercase..
:/<BR>/i # try 3.. ahh.. that'll be it then
:%s/<BR>/<br>/gi # lets execute my regex substitution
:/BR # check 1.. testing things are OK in my file by searching through..
:/br # check 2.. yep thats ok..
/bR # check 3 ..yup..
:/<BR>/i # check 4.. yep looks ok... the problem tags seem to be gone
# ...manual scroll through the document.. looks good!
:w # write out to file
:set nohlsearch # debug off
:set nowrapscan # debug off

For More..

Regexs are the gift that just keeps on giving. Here are some good resources on regexs in general, and regexs in VIM.

VirutalHosts on CentOS

A common task when setting up an Apache webserver under Linux, is writing a httpd.conf file. The httpd.conf file is the main configuration file for Apache. One of the main reasons to edit the httpd.conf file is to setup virtual hosts In Apache. A Virtual host configuration allows several different domains to be run off a single instance of Apache, on a single IP. Each host is a ‘Virtual host’, and typically has a different web root, log file, and any number of subdomains aliased to it. The virtualhosts are configured in parts of the httpd.conf file that look like this:

    DocumentRoot /var/www/html/
    ErrorLog logs/
    CustomLog logs/ common

Now on Ubuntu, virutalhosts are made easy. The httpd.conf is split into several files. Each virutalhost has a different file in /etc/apache2/sites-available. When you want to activate a particular vitualhost, you create a symbolic link from /etc/apache2/sites-enabled/mysiteto /etc/apache2/sites-available/mysite (if you wanted to call your site configuration file ‘mysite’). When apache boots up, it loads all the files it can find in /etc/apache2/sites-available/* and that determines which virutalhosts it loads. If there is not a link from /etc/apache2/sites-available/ to your virutalhost file, it won’t load it. So you can easily remove the links in /etc/apache2/sites-available without deleting the actual virutalhost file. Therefore you can easily toggle which virtualhosts get loaded.

CentOS uses a different structure. Everything is lumped into /etc/apache/httpd.conf. So there is no way to easily toggle virutalhosts on/off, and everything is a bit more chaotic. I’ve just had to setup a new CentOS webserver, and I struggled for a bit after being used to ubuntu-server. Here’s a format you can use if you’re in the same boat, and you have to setup httpd.conf files for CentOS:

NameVirtualHost *:80 # this is eseential for for name-based switching

# an example of a simple VirtualHost that serves data from 
# /var/www/html/ to anyone who types in
# to the browser

    DocumentRoot /var/www/html/
    ErrorLog logs/
    CustomLog logs/ common 

# an example of a VirutalHost with apache overrides allowed, this means you can use
# .htaccess files in the servers web root to change your config dynamically

    DocumentRoot /var/www/html/
    ErrorLog logs/
    CustomLog logs/ common
      AllowOverride All
      AllowOverride AuthConfig
      Order Allow,Deny
      Allow from All

# an example of a VirutalHost with apache overrides allowed, and two subdomains
# (mail and www) that both point to the same web root

    DocumentRoot /var/www/html/
    ErrorLog logs/
    CustomLog logs/ common
      AllowOverride All
      AllowOverride AuthConfig
      Order Allow,Deny
      Allow from All

# .. etc

With the above structure, you can add as many VirutalHosts to your configuration as you have memory to support (typically dozens). Apache will decide on which to choose based on the ‘ServerName’ specified in each VirtualHost section. Just remember to add that all-important NameVirtualHost: *:80 in the beginning.

Once you’ve got your httpd.conf file the way you like it, be sure to test it before you restart apache. If you restart apache and your httpd.conf file has errors in it, Apache will abort the load process. This means that all the websites on your webserver will fail to load. I always use apachectl -t or apache2ctl -t before I restart. That will parse the httpd.conf file and check the syntax. Once that’s OK, then you can issue a /etc/init.d/httpd restart to restart Apache.

MicroKORG + Python = MIDI fun!

microKORG and cat

So, about a month ago I got a second-hand microKORG from Ebay. Fiddling around with the preset patches, and creating new patches is great fun, even though I only know a few chords. Recently I plugged it in to my PC via my M-Audio Uno USB->MIDI interface, and soon was using Ableton Live to program drums in time with the microKORG’s arp.

I thought I’d experiment the music libraries available in python, and see if I could send notes to the synth via MIDI. Turns out that the M-Audio Uno is supported under Ubuntu, all you have to do is install the midisport-firmware package. With the help of pyrtmidi, a set of python wrappers around the C++ audio library rtmidi I was able to recieve MIDI signals in realtime from the microKORG, and send them in realtime also. With the help of this old midi file reader/writer library that I found posted to a python mailing list, I’ve made some progress in writing a simple MIDI file player that sends notes to the ‘KORG.

Eclipse 3.4.2 + Pydev + Eclim = win

So, after saying all that stuff about how vimplugin and EasyEclipse was great, I actually started to use the setup heavily, and it started to annoy me.

For one, EE is not a recent build of eclipse, nor does it come with a full set of recent plugins. This makes it annoyingly difficult to use when you want to use more than the set of plugins it packages for you. As far as vimplugin goes, it does not provide the vim integration I thought it might from embedded vim. Not really even close.

What I use now, after lots of trial and error, and at least 4 reinstalls of Eclipse, is a combination of Eclipse 3.4.2, Eclim, (which is the most mature of the free vi-binding plugins around, and actually includes an improved version of the vimplugin previously mentioned), and the latest pydev, Mylyn and Subeclipse.

I’m using it now to refactor a largeish python project, and I’m really appreciating the help it gives me. Definitely worth trying an Eclipse setup similar to this if you’re writing any python apps that are more than small-scale.

EasyEclipse + Vimplugin for Python Development

Up until now, I’ve always used the terminal for programming development on my projects. Because I’m so familiar with the advanced text editor vim, I can get a lot done on the command line, and it doesn’t detract away from what is actually going on behind the scenes, as a lot of IDEs seem to do.

However, in reading the book Foundations of Agile Python Development (which I recommend highly), and through working in software houses using IDEs only, I’ve come to realise that I need to gain at least some familiarity with an IDE.

So I’ve decided to try out Eclipse. I fiddled around with the Eclipse version in the Ubuntu 8.10 repositories for a while, with little success. I wanted to install pydev and vimplugin. Pydev is an eclipse python development environment. Vimplugin allows vim keybindings, and can actually embed the gvim editor within Eclipse. I tried for a few hours, but couldn’t get it all working with the stock Eclipse version in the Ubuntu repositories.

So I thought I’d try out EasyEclipse. EasyEclipse bundles a stable version of Eclipse with pydev in its “Easy Eclipse for Python Development” distribution, and that worked a charm. I then installed vimplugin which worked immediately when enabled, and supported embedded VIM mode within Eclipse. In the screenshot below, you should be able to (just about!) see what I mean, gvim is embedded into Eclipse:

Google Sync for Mobile

If you use Google calendar, and you’ve got an iPhone, or Windows Mobile phone like me, then you’ll be pleased to hear about the new Google Sync for Mobile tool just brought out into beta by Google. There were various ways to sync Google Calender events to Windows Mobile devices before, but nothing officially supported. Google uses an ActiveSync server to push the events to your phone, making things a lot easier. To quote:

Google Sync for Mobile is available for most mobile phones. On iPhone, Blackberry and Windows Mobile devices Google Sync enables over-the-air synchronization of Google Calendar and Google Contacts to the built-in Calendar and Address Book applications on your phone. On most other mobile phones, Google Sync enables wireless synchronization of Google Contacts to the built-in Address Book application.

One Laptop Per Child – My XO Laptop

OLPC XO Laptop

I did something out of the ordinary this Christmas. I bought an “XO” laptop for a child in a third world country. I also bought an XO laptop for myself, so I can develop software designed to be distributed to the 1 million+ XO laptops out there in the third world.

The laptop runs a Linux operating system, with an special interface programmed almost exclusively in Python. Most apps, called ‘activities’, run as python programs. This is ideal for me, as I enjoy hacking around in python, and Linux is – of course – very familiar to me.

I hope to use the laptop to contribute to health informatics applications that allow the laptop to be used in hospitals where there are no existing comparable health systems. If you wish to get involved in this project and give something back, here in the UK you can now order your own laptop from

The Blog Factory

I’ve started my own blog consulting business, helping people setup their own blogs, either for their company or for personal use. It’s called The Blog Factory, see the site for more information on what we do. In a nutshell, we can:

1) Setup and customise WordPress blogs.
2) Design custom WordPress themes.
3) Develop tailored WordPress plugins.
4) Host the blogs on our servers.
5) Use our SEO expertise to improve traffic to the blog.

FREE Cloud Computing testbed for Python Apps

Monty Python Foot

This is so cool.. Google are beta-testing a totally free hosting and cloud-computing resource called Google App Engine. The caveat is that your hosted app must be written in Python. Python is amazing anyway, and if you don’t know it, now is the perfect time to learn. Check this out for more information about Google App Engine:

They’re giving away a very generous 500MB disk space and enough processing power to serve 5 million pages a month. Awesome!

Bacula Scheduling

Bacula Logo

Bacula is a great open-source distributed backup program for Linux/UNIX systems. It is separated into three main components:

  • One ‘Director’ – which sends messages to the other components and co-ordinates the backup
  • One or more ‘File Demons’ – which ‘pull’ the data from the host they are installed from.
  • One or more ‘Storage Demons’ – which ‘push’ the data taken from the file demons into a type of archival storage, IE: backup tapes, a backup hard disc, etc

I found it extremely versatile yet very complicated to configure. Before you configure it you have to decide on a backup strategy; what you want to backup, why you want to back it up, how often you want to back it up, and how you are going to off-site/preserve the backups.

I can’t cover everything about Bacula here, so I thought I’d concentrate on scheduling. You will need to understand quite a lot about the basics of Bacula before you’ll be able to understand scheduling, so I recommend reading up on the basics first.

I had the most problems with the scheduling. In the end I chose to adopt this schedule:

  • Monthly Full backup, with a retention period of 6 months, and a maximum number of backup volumes of 6.
  • Weekly Differential backup against the monthly full backup, with a retention period of 1 month, and a maximum number of backup volumes of 4.
  • Daily Incremental backup against the differential backup, with a retention period of 2 weeks, and a maximum number of backup volumes of 14.

This means that there will always be 2 weeks of incremental backups to fall back on which depend on a weekly differential which depends on a monthly full. This strategy aims to save as much space as possible – there is no redundancy. This means that if a backup fails, especially a monthly or weekly, it will have to be re-run immediately.

The backup volumes will cycle using this method; they will be reused once the maxium volume limits are hit. Also, if you run a backup job from the console, it will revert to the ‘Default’ pool, so you will have to explicitly define either the daily incremental, weekly differential or the monthly full pools.

Here is my director configuration:

Job {
  Name = "Backup def"
  Type = Backup
  Client = localhost-fd
  FileSet = "Full Set"
  Storage = localhost-sd
  Schedule = "BackupCycle"
  Messages = Standard
  Pool = Default
  Full Backup Pool = Full-Pool
  Incremental Backup Pool = Inc-Pool
  Differential Backup Pool = Diff-Pool
  Write Bootstrap = "/var/bacula/working/Client1.bsr"
  Priority = 10
Schedule {
  Name = "BackupCycle"
  Run = Level=Full Pool=Full-Pool 1st mon at 1:05
  Run = Level=Differential Pool=Diff-Pool mon at 1:05
  Run = Level=Incremental Pool=Inc-Pool mon-sun at 1:05
# This is the default backup stanza, which always gets overridden by one of the other Pools, except when a manual backup is performed via the console.
Pool {
  Name = Default
  Pool Type = Backup
  Recycle = yes                     # Bacula can automatically recycle Volumes
  AutoPrune = yes                   # Prune expired volumes
  Volume Retention = 1 week         # one week
Pool {
  Name = Full-Pool
  Pool Type = Backup
  Recycle = yes           # automatically recycle Volumes
  AutoPrune = yes         # Prune expired volumes
  Volume Retention = 6 months
  Maximum Volume Jobs = 1
  Label Format = "Monthly-Full-${Year}-${Month:p/2/0/r}-${Day:p/2/0/r}-${Hour:p/2/0/r}-${Minute:p/2/0/r}" 
  Maximum Volumes = 6
Pool {
  Name = Inc-Pool
  Pool Type = Backup
  Recycle = yes           # automatically recycle Volumes
  AutoPrune = yes         # Prune expired volumes
  Volume Retention = 1 month
  Maximum Volume Jobs = 1
  Label Format = "Weekly-Inc-${Year}-${Month:p/2/0/r}-${Day:p/2/0/r}-${Hour:p/2/0/r}-${Minute:p/2/0/r}" 
  Maximum Volumes = 4
Pool {
  Name = Diff-Pool
  Pool Type = Backup
  Recycle = yes
  AutoPrune = yes
  Volume Retention = 14 days
  Maximum Volume Jobs = 1
  Label Format = "Daily-Diff-${Year}-${Month:p/2/0/r}-${Day:p/2/0/r}-${Hour:p/2/0/r}-${Minute:p/2/0/r}" 
  Maximum Volumes = 7

I also recommend the excellent O’Reilly book – “Backup and Recovery” which comprehensively covers backups, and has a chapter on Bacula.

Linux under Hyper-V

This is an overview of current Linux support under Hyper-V, the free Windows Server 2008 virtualisation product.

As you probably know, virtual servers allow the emulation of hardware in software. So you have a single physical ‘virtual server’. This virtual server emulates the physical hardware for several ‘virtual machines’ which sit on top of the virtual server. As far as the operating system on the virtual machine is concerned, it doesn’t notice anything different at all – it thinks it is running on a full set of dedicated hardware. However in reality, the virtual server is sharing its real physical resouces amongst the collection of virtual machines, assigning for example – 3GB of its memory to virtual machine A, and 1GB to virtual machine B.

Hyper-V requires a package called the ‘integration components’ to be installed on each virtual machine. This makes it easier for the virtual machine operating system kernel to talk with the virtual server, speeds up and increases the reliability of the emulation of virtual machine hardware by the virtual server.

Hyper-V supports the Xen virtualisation layer. Xen is a Linux-only virtualisation platform that requires a patched kernel on the virtual machine in order for the virtual server to communicate with the virtual machine.

The only officially supported distribution is SuSE Enterprise Linux 10, Service Packs 1 + 2. However because of the way the Hyper-V works, any Xen kernel in theory can be patched to run under Hyper-V.

In fact, this is exactly what you have to do in order to get any Linux distro fully working under the Hyper-V virtual server. So:

  1. Download the kernel source for the Xen kernel for your distro
  2. Patch it with the Hyper-V integration services patch
  3. Compile the kernel

.. and bingo – you should have Linux ‘fully supported’ by Hyper-V.

Now it remains to be seen whether there are any problems with running Linux under Hyper-V in a live environment. I know that we encountered multiple problems with Linux under VMWare ESX Server – and VMWare is the most mature virtualisation product available.

Those problems included: network interfaces dropping packets and the virtual machine system clock ‘drifting’ – running too fast or slow. These are NOT problems you want in a live enviroment, and so it remains to be seen whether Hyper-V can support Linux with the same ease as it supports Windows OSs.

Stanford Engineering for Everyone

The Stanford engineering department, often regarded as the best in the world for computer science education, has made its core CS curriculum free for anyone with an internet connection. There are some catches, ie: you don’t get your assignments marked, you have no contact with the lecturer, but all the same, it is really a great resource. The material is very high-quality, professionally filmed lectures and a full compliment of handouts and course notes. It also does not even assume knowledge of programming – it teaches you right from the basics.

If I was trying to teach myself CS again, these courses would be an ideal place to start:

Stanford Engineering Everywhere

Automated Emails on Commiting to a Subversion Repository Using Python

At work I’ve written a couple of scripts that send out emails to the appropriate project team when someone checks in a commit to the project subversion repository. Here are the details.

Firstly, you will need a subversion hook setup on post-commit. The post-commit hook needs to be located in SVNROOT/YOURPROJECT/hooks where YOURPROJECT is your svn project name, and SVNROOT is the root directory where you are storing the data files for your subversion repository.

Substitute projectmember1,projectmember2 etc.. in the post-commit script below for the email addresses of the people to be notified when someone commits a change to the project.


# The post-commit hook is invoked after a commit.  Subversion runs
# this hook by invoking a program (script, executable, binary, etc.)
# named 'post-commit' (for which this file is a template) with the
# following ordered arguments:
#   [1] REPOS-PATH   (the path to this repository)
#   [2] REV          (the number of the revision just committed)
# The default working directory for the invocation is undefined, so
# the program should set one explicitly if it cares.
# Because the commit has already completed and cannot be undone,
# the exit code of the hook program is ignored.  The hook program
# can use the 'svnlook' utility to help it examine the
# newly-committed tree.
# On a Unix system, the normal procedure is to have 'post-commit'
# invoke other programs to do the real work, though it may do the
# work itself too.
# Note that 'post-commit' must be executable by the user(s) who will
# invoke it (typically the user httpd runs as), and that user must
# have filesystem-level permission to access the repository.
# On a Windows system, you should name the hook program
# 'post-commit.bat' or 'post-commit.exe',
# but the basic idea is the same.
# The hook program typically does not inherit the environment of
# its parent process.  For example, a common problem is for the
# PATH environment variable to not be set to its usual value, so
# that subprograms fail to launch unless invoked via absolute path.
# If you're having unexpected problems with a hook program, the
# culprit may be unusual (or missing) environment variables.
# Here is an example hook script, for a Unix /bin/sh interpreter.
# For more examples and pre-written hooks, see those in
# the Subversion repository at
# and

LOG="svnlook log $REPOS -r$REV"
/usr/bin/python /usr/local/scripts/ $EMAILS $REPOS `whoami` $REV "`$LOG`" > /tmp/svncommitemail.log

Secondly you will need this python script:

Edit the ‘fromaddr’ variable to equal your configuration manager’s email address (probably your own!).

#!/usr/bin/env python

import smtplib
import sys
import getopt
import time
import datetime
import string

def sendMail(subject, body, TO, FROM):
    HOST = "localhost"
    BODY = string.join((
        "From: %s" % FROM,
        "To: %s" % TO,
        "Subject: %s" % subject,
        ), "rn")
    server = smtplib.SMTP(HOST)
    server.sendmail(FROM, [TO], BODY)

def send(alias,rev,username,repo,changelog):

        today =
        fromaddr = 'Configuration.Management@YOURCOMPANY.COM'
        subject = """Subversion repository """+repo+""" changed by """+username+""" on """+str(today)

        aliases = alias.split(',')
        for alias in aliases:
                body = """

        This is an automated email to let you know that subversion user: '"""+username+"""' has updated repository """+repo+""" to version """+rev+""". The changelog (might be empty) is recorded as:


Please contact subversion user: '"""+username+"""' in the first instance if you have any questions about this commit.

Configuration Management

argv = sys.argv
argc = len(sys.argv)
if argc == 6:
        alias = argv[1]
        repo = argv[2]
        username = argv[3]
        rev = argv[4]
        changelog = argv[5]
        print "Usage: "+argv[0]+"     "

Now once you have this all in place, test it by creating a a test file in the repository, and commiting it. If you issue a “tail -f /tmp/svncommitemail.log” on the box where your subversion project repository is located, you should be able to see what happens when people commit to the project repository.

If it is setup correctly, you will see emails being fired off to all interested parties with information about the svn commit.

Scraping Wikipedia Information for music artists, Part 2

I’ve abandoned the previous Wikipedia scraping approach for, as it was unreliable and didn’t pinpoint the right Wikipedia entry – ie: a band called ‘Horses’ would pull up a Wikipedia bio on the animal – which doesn’t look very professional. So instead, I have used the Musicbrainz API to retrieve some information on the artist; the homepage URL, the correct Wikipedia entry, and any genres/terms the artist has been tagged with.

It would be simple to extend this to fetch the actual bio from a site like (which provides XML-tagged Wikipedia data), now that you always have the correct Wikipedia page reference to fetch the data from.

(You will need to download the Musicbrainz python library to use this code):

import time
import sys
import logging
from musicbrainz2.webservice import Query, ArtistFilter, WebServiceError
import musicbrainz2.webservice as ws
import musicbrainz2.model as m

class scrapewiki2(object):

  def __init__(self):

  def getbio(self,artist):

    art = artist
    logger = logging.getLogger()

    q = Query()

      # Search for all artists matching the given name. Limit the results
      # to the 5 best matches. The offset parameter could be used to page
      # through the results.
      f = ArtistFilter(name=art, limit=1)
      artistResults = q.getArtists(f)
    except WebServiceError, e:
      print 'Error:', e

    # No error occurred, so display the results of the search. It consists of
    # ArtistResult objects, where each contains an artist.

    if not artistResults:
      print "WIKI SCRAPE - Couldn't find a single match!"
      return ''

    for result in artistResults:
      artist = result.artist
      print "Score     :", result.score
      print "Id        :",
        print "Name      :",'ascii')
      except Exception, e:
      print 'Error:', e

    print "Id         :",
    print "Name       :",

    # Get the artist's relations to URLs (m.Relation.TO_URL) having the relation
    # type ''. Note that there could
    # be more than one relation per type. We just print the first one.
    wiki = ''
    urls = artist.getRelationTargets(m.Relation.TO_URL, m.NS_REL_1+'Wikipedia')
    if len(urls) > 0:
      print 'Wikipedia:', urls[0]
      wiki = urls[0]

    # List discography pages for an artist.
    disco = ''
    for rel in artist.getRelations(m.Relation.TO_URL, m.NS_REL_1+'Discography'):
      disco = rel.targetId
      print disco

      # The result should include all official albums.
      inc = ws.ArtistIncludes(
        releases=(m.Release.TYPE_OFFICIAL, m.Release.TYPE_ALBUM),
      artist = q.getArtistById(, inc)
    except ws.WebServiceError, e:
      print 'Error:', e

    tags = artist.tags

    toret = ''
      toret = ''+art+' Wikipedia Articlen'
      toret = toret + ''+art+' Main Siten'
      toret = toret + '
Tags: '+(','.join(t.value for t in tags))+'n' return toret sw2 = scrapewiki2() # unit test print sw2.getbio('Blur') print sw2.getbio('fatboy slim')

Apologies to the person that left several comments on the previous wikipedia scraping post, I have disabled comments temporarily for now due to heavy amounts of spam, but you can contact me using the following address: (subsitute first two @s for ‘.’s ). I also hope this post answers your question.

Character encoding fix with PHP, MySQL 5 and ubuntu-server

For some reason, under ubuntu-server, my default MySQL 5 character encoding was latin1. This caused no end of problems with grabbing data from the web, which was not necessarily in latin1 characterset.

If you are ever in this situation, I suggest you handle everything as UTF-8. That means setting the following lines in my.cnf:


If you already have tables in your database that you have created, and they have defaulted to the latin1 charset, you’ll be able to tell by looking at the mysqldump SQL:

.. some col declarations..

See here this artists table has been set to default charset of latin1 by mysql. This is bad. So what I recommend is:

1. Dump the full database structure + data to a file using mysqldump
2. Substitute ‘latin1’ for ‘utf8’ universally on that file using your favourite text editor
3. Import the resultant file into mysql using the mysql -uroot -p -Dyourdb < dump.sql method

Then everything will be in utf8, and your character encoding issues will be solved 🙂

Scraping artists bios off of Wikipedia

I’ve been hacking away at and I’ve been looking for a way of automatically sourcing biographical information from artists, so that visitors are presented with more information on the event.

The Songbird media player plugin ‘mashTape’ draws upon a number of web services to grab artist bio, event listings, youtube vidoes and flickr pictures of the currently playing artist. I was reading through the mashTape code, and then found this posting by its developer, which helpfully provided the exact method I needed.

I then hacked up two versions of the code, a PHP version using simpleXML:

    if($ar[2] == ''){
      $wikikey = $ar[4]; // more than likely to be the wikipedia page
      return ""; // nothing on wikipediea
    $url = "$wikikey";
    $x = file_get_contents($url);
    $s = new SimpleXMLElement($x);
    $b = $s->xpath("//p:abstract[@xml:lang='en']");
     return $b[0];

and a pythonic version using the amara XML library (has to be installed seperately):

import amara
import urllib2
from urllib import urlencode

def getwikikey(band):
  url = ""+band+"%22&";
  print url
  doc = amara.parse(f)
  url = str(doc.ResultSet.Result[0].Url)
  return url.split('/')[4]

def uurlencode(text):
   """single URL-encode a given 'text'.  Do not return the 'variablename=' portion."""
   blah = urlencode({'u':text})
   blah = blah[2:]
   return blah

def getwikibio(key):
  url = ""+str(key);
  print url
  except Exception, e:
    return ''
  doc = amara.parse(f)
  b = doc.xml_xpath("//p:abstract[@xml:lang='en']")
    r = str(b[0])
  except Exception, e:
    return ''
  return r

def scrapewiki(band):
    key = getwikikey(uurlencode(band))
  except Exception, e:
    return ''
  return getwikibio(key)

  #unit test
  #print scrapewiki('guns n bombs')
  #print scrapewiki('diana ross')

There we go, artist bio scraping from wikipedia.

adExcellence Exam passed

I passed the adExcellence exam first time.. woo! It wasn’t that difficult really.

“David Craddock of iCrossing is accredited as an official Microsoft adExcellence Member. A Microsoft adExcellence Member has completed comprehensive online training on managing Microsoft adCenter search engine marketing campaigns and has demonstrated expert knowledge by passing the Microsoft adExcellence accreditation exam.”

As of 21/3/08, I’m somehow also now #1 on for the keyword “adExcellence exam”.. if that’s what you googled for, you probably want the adExcellence main site instead. Or use Live Search.

Yahoo! Pipes

Yahoo Pipes Logo

I have just seen Yahoo! Pipes, and am convinced this is going to change the web. For real.

Data source sites will become ‘content providers’, data will be aggregated and filtered from multiple content providers, either by the user or by ‘intermediary’ sites. The user will be able to choose his ‘data view’ of the content on the internet, just as Google is currently doing.

This is fascinating stuff if you’re involved in the web industry.

A poor man’s VMWare Workstation: VMWare Server under Ubuntu 7.10 + VMWare Player under Windows XP

I finally setup my Dell Lattitude D630 laptop the way I wanted it last night, and thought I’d do a quick writeup about it. Here is the parttition table:

  1. A 40GB Windows XP partition, with VMWare Player installed, which I will be using for Windows applications that don’t play well in virtualised mode (eg media applications). I will also be using it as the main platform for running VMs.
  2. A basic 5GB root + 1.4GB swap 7.10 Ubuntu server partition, with VMWare Server installed (for creating, advanced editing and performing network testing on VMs). I used these VMWare server on Ubuntu 7.10 tutorials.
  3. A 36GB NTFS partition for storing VMs
  4. A 26GB NTFS media partition for media I want to share between VMs and the two operating systems on the disc.

We use VMWare servers at work to host our infrastructure, so this setup will be very useful for me. I can now:

  1. Take images off the servers at work and bring them up, edit them and test their network interactions under my local VMWare Server running on my Linux install.
  2. From within my windows install, I can bring up a Linux VM and use Windows and Linux side by side.

Brighton Barcamp2

I will be attending Brighton Barcamp 2 on the weekend of the 14th March, and presenting on a new web project I’ve been working on.

See: and for more info.

Update: Brighton Barcamp 2 is now over.

This was really interesting, and I learned a huge amount in a very short amount of time. Thanks to everyone who talked to me. I’ll definitely be attending future Barcamps.