JSoup Method for Page Scraping

Soup bowl

I’m currently in the process of writing a web scraper for the forums on Gaia Online. Previously, I used to use Python to develop web scrapers, with the very handy Python library BeautifulSoup. Java has an equivalent called JSoup.

Here I have written a class which is extended by each class in my project that wants to scrape HTML. This ‘Scraper’ class deals with the fetching of the HTML and converting it into a JSoup tree to be navigated and have the data picked out of. It advertises itself as a ‘web spider’ type of web agent and also adds a 0-7 second random wait before fetching the page to make sure it isn’t used to overload a web server. It also converts the entire page to ASCII, which may not be the best thing to do for multi-language web pages, but certainly has made the scraping of the English language site Gaia Online much easier.

Here it is:

import java.io.IOException;
import java.io.InputStream;
import java.io.StringWriter;
import java.text.Normalizer;
import java.util.Random;
import org.apache.commons.io.IOUtils;
import org.apache.http.HttpEntity;
import org.apache.http.HttpResponse;
import org.apache.http.client.HttpClient;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.DefaultHttpClient;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;

/**
* Generic scraper object that contains the basic methods required to fetch
* and parse HTML content. Extended by other classes that need to scrape.
*
* @author David
*/
public class Scraper {

        public String pageHTML = ""; // the HTML for the page
        public Document pageSoup; // the JSoup scraped hierachy for the page


        public String fetchPageHTML(String URL) throws IOException{

            // this makes sure we don't scrape the same page twice
            if(this.pageHTML != ""){
                return this.pageHTML;
            }

            System.getProperties().setProperty("httpclient.useragent", "spider");

            Random randomGenerator = new Random();
            int sleepTime = randomGenerator.nextInt(7000);
            try{
                Thread.sleep(sleepTime); //sleep for x milliseconds
            }catch(Exception e){
                // only fires if topic is interruped by another process, should never happen
            }

            String pageHTML = "";

            HttpClient httpclient = new DefaultHttpClient();
            HttpGet httpget = new HttpGet(URL);

                HttpResponse response = httpclient.execute(httpget);
                HttpEntity entity = response.getEntity();

                if (entity != null) {
                    InputStream instream = entity.getContent();
                    String encoding = "UTF-8";

                    StringWriter writer = new StringWriter();
                    IOUtils.copy(instream, writer, encoding);

                    pageHTML = writer.toString();
                    
                    // convert entire page scrape to ASCII-safe string
                    pageHTML = Normalizer.normalize(pageHTML, Normalizer.Form.NFD).replaceAll("[^\p{ASCII}]", "");

                }

                return pageHTML;
        }

        public Document fetchPageSoup(String pageHTML) throws FetchSoupException{
            
            // this makes sure we don't soupify the same page twice
            if(this.pageSoup != null){
                return this.pageSoup;
            }
            
            if(pageHTML.equalsIgnoreCase("")){
                throw new FetchSoupException("We have no supplied HTML to soupify.");
            }

            Document pageSoup = Jsoup.parse(pageHTML);

            return pageSoup;
        }
}

Then each class subclasses this scraper class, and adds the actual drilling down through the JSoup hierachy tree to get what is required:

...
this.pageHTML = this.fetchPageHTML(this.rootURL);
this.pageSoup = this.fetchPageSoup(this.pageHTML);

// get the first  section on the page
Element forumPageLinkSection = this.pageSoup.getElementsByAttributeValue("id","forum_hd_topic_pagelinks").first();
// get all the links in the above 
section Elements forumPageLinks = forumPageLinkSection.getElementsByAttribute("href"); ...

I’ve found that this method provides a simple and effective way of scraping pages and using the resultant JSoup tree to pick out important data.

Disabling Control-Enter and Control-B shortcut keys in Outlook 2003

At work, I still have to use Windows XP and Outlook 2003. I don’t particually mind this, except when I draft an email to someone and accidently I press Control-B instead of Control-V. Control-B will go ahead and send your partially composed email, resulting in some embarassment as you have to tell everyone to disregard it.

So I wanted to remove the ‘send email’ shortcut keys in Outlook 2003. There are two ways of doing this, one involves editing your group policy, which is something only my IT administration team can do, and I didn’t want to have to involve them. The other way is by making a change to your registry, which I will describe here.

  1. Open up regedit, and browse to the following registry key: HKEY_CURRENT_USER -> Software -> Policies -> Microsoft -> office -> 11.0 -> outlook
  2. Then create a new key called: “DisabledShortcutKeysCheckBoxes”.
  3. Under that key, create two new String Values:
    Name: CtrlB Data: 66,8
    Name: CtrlEnter Data: 13,8
  4. Then restart Outlook and those keys will be disabled.

Click on the thumbnail below to see what the finished edit should look like:

Directory names not visable under ls? Change your colours.

There is a problem I frequently encouter on Redhat/Fedora/CentOS systems with the output of the ls command. Under those distributions, the default setup is to display directories in a very dark colour. If you usually use a white foreground and a black background on your terminal client (such as Putty) then you will struggle to read the names of the directories under Redhat-based distributions.

There are two soloutions that I have used:

1. Change the colour settings in Putty

If you use Putty, ticking ‘Use System Colours’ here changes the “white foreground, black background” default into a “white background, black foreground”. This way you can at least read the console properly, good for a quick fix. You can also save these settings in putty to be the default for the host that you are connecting to, or even all hosts.

2. Change the LS_COLORS directive temporarily in the shell.

Alternatively, you can ask the ls command to display directories and other entries in colours that you specify. You could add these lines to the bottom of your .bashrc to make these changes permanent, or if you are using a shared machine, just copy and paste the following lines into the terminal and they will change the colours to a reddish more visable set, until you logout. :

alias ls='ls --color' # just to make sure we are using coloured ls
LS_COLORS='di=94:fi=0:ln=31:pi=5:so=5:bd=5:cd=5:or=31:mi=0:ex=35:*.rpm=90'
export LS_COLORS

(Original source for this particular LS_COLORS combo: http://linux-sxs.org/housekeeping/lscolors.html)

Find large files by using the OSX commandline

To quickly find large files to delete if you have filled your startup disk, enter this command on the OSX terminal:

sudo find / -size +500000 -print

This will find and print out file paths to files over 500MB. You can then go through them and delete them individually by typing rm “<file path>”, although there is no undelete so make sure you know you won’t miss them.

Finding files in Linux modified between two dates

You use the ‘touch’ command to create two blank files, with a last modified date that you specify – one with a date of the start of the range you want to specify, and the second with a date at the end of the range you want to specify. Then you reference to those two files in your find command:

touch /tmp/temp -t 200604141130
touch /tmp/ntemp -t 200604261630
find /data/ -cnewer /tmp/temp -and ! -cnewer /tmp/ntemp

Writing simple email alerts in PHP with MagpieRSS

I wrote an email alerter that sends me an email whenever the upcoming temperature may dip below freezing. It uses the Magpie RSS reader to pull down a 3 day weather forecast that is provided for my area in RSS form by the BBC weather site. It then parses this forecast and determines if either today’s or tomorrow’s weather may dip below freezing. If it might, it sends an email to my email address to warn me.

I scheduled this script to run every day by adding it as a daily cron job on my web host. You can set this up for any web hosts that support cron jobs.

items) != 3){
                $message .= 'Error: problem parsing BBC weather feed';
        }
        $i=0;
        foreach ($rss->items as $item) {
                $href = $item['link'];
                $title = $item['title'];
                preg_match('/Min Temp:.+?-*d*/',$title,$mintemp);
                preg_match('/Max Temp:.+?-*d*/',$title,$maxtemp);
                $mintemp[0] = str_replace('Min Temp: ','',$mintemp[0]);
                $maxtemp[0] = str_replace('Max Temp: ','',$maxtemp[0]);
                $mins[$i] = (int)$mintemp[0];
                $maxs[$i] = (int)$maxtemp[0];
                $i++;
        }

        // freezing warnings

        if($mins[0] < 0){
                $message .= "Today's temperature in W3 may go below freezing, anything down to ".$mins[0];
        }
        if($mins[1] 

You can right click on this link and ‘save as’ to download the script.

Converting week numbers to dates

Here is some python code I adapted from this stackoverflow post to get the first day of a week specificed by a week number. This method includes leap year and summer time differences.

import time
def weeknum(num,year):
	instr = str(year)+" "+str(num-1)+" 1"
	print time.asctime(time.strptime(instr,'%Y %W %w'))

Here is me exectuting the code in Python’s IDLE shell:

See that the first week of 2009 actually started in 2008, but by the end of that week we are in 2009.

MediaMonkey allows you to transfer music from any computer onto your guest iPhone

MediaMonkey is a popular free media player for Windows. It has a great feature that allows you to transfer to and from an iPhone that is not registered with your computer. Normally only one iTunes install can be associated with your iPhone, but MediaMonkey allows you another way to transfer music and audio files with a ‘guest’ iPhone. Check it out, it works:

MediaMonkey

Recording Game Videos on Windows 7

This is just a quick note to remind myself how I did this.

  • Hypercam2 is a good, free, video recorder that can cope with recording game videos. It’s freely available from http://www.hyperionics.com/hc/downloads.asp – just make sure when you install it you don’t tick on the spyware toolbar installation options.
  • My motherboard has a 5.1 digital soundcard built in. However the only way I can record off the soundcard is to plug in a standard audio cable from the speaker out (green) to the microphone in (orange).
  • The soundcard switches off the headphone output when it detects a speaker attached to the speaker out, so you have to go to the recording options in Windows 7 and right click on the microphone in. It will give you an option to ‘Monitor this input using the headphones’ – which will allow you to listen to anything coming into the microphone socket through the headphone socket on the front on my PC.
  • In hypercam, set the sound to record from the default input device, set the frame rate to 10/10
  • Record using the ‘select window to record from’ option, select the game window, and use the F2 button to start and stop the recording.
  • The video will be output in AVI format, but you can transcode or convert it into a quicktime MOV file for editing in iMovie, or you can use windows movie editor, which is free and quite good.