It’s pretty easy downloading the latest version of Nutch but I had a few issues getting it set up on my Red Hat server; it’s pretty easy really but there are a couple of gotchas along the way and it doesn’t work exactly as specified in the tutorial.

  1. Download & install Java – super simple: yum install java
  2. wget & unzip the latest version of Nutch
  3. I had issues with the JAVA_HOME environment when trying to follow the example on  crawling a website. The error I got was “/usr/loca/jdk/bin/java: No such file or directory” – the problem here was twofold: 1) I didn’t have java set in my environment variables, and 2) around line 118 in the bin/nutch file there’s a reference to $JAVA_HOME/bin/java – the bold part of which seems unnecessary and should be deleted

 

The below is an internal email sent out today at OMD which I thought would be nice to share with the wider SEO community.

————

Today marks the day last year when we lost a highly valued colleague and friend on the OMD SEO team Jaamit Durrani. His dedication, passion, humour & intelligence he showed while he was with us has had a lasting effect on us personally, but also has been crucial to the SEO team’s growth from a very small team last year to a highly successful team of 15+ this year; a year where we’ve won our first ever major SEO-only client, scored 100% in client feedback and continue to build a better offering month on month.

As a small tribute, and hopefully an insight for those who weren’t lucky enough to meet or work with him, this week’s links highlight some of his best blog posts, as well as some of the tributes posted online: Read more »

Get HTML returned from HTML::TreeBuilder::XPath

Perl’s HTML::TreeBuilder::XPath is a great module for parsing HTML documents without regular expressions, however it returns text content by default, which is not always what you want when you’re doing advanced HTML processing. The documentation on CPAN doesn’t mention this, but if you want to get out the HTML content, just use “findnodes” and “->shift->as_HTML” in the way illustrated below:

my $value = $tree->findnodes(q{//div[@class='crumbs'})->shift->as_HTML

This post purely reflects my own personal opinion, and does not reflect those of my employer or colleagues. For those faint of heart, warning: contains traces of black hat material.

So Google has taken our keyword data away, probably for good. What next for SEO?

I think we have to assume this rollout will eventually happen everywhere, for all users. Why? What’s in it for Google?

  1. Less competition – the data provided on search by firms such as Experian Hitwise, Comscore, Quantcast becomes far less valuable, meaning the only media company that can authoritatively provide keyword data is, you guessed it, Google.
  2. Less spam – from a purely objective point of view, I do think this will result in much less spam. Sites such as Mahalo, Experts Exchange, etc will all suffer as the pages generated purely based on search volume will die a death, and Google’s results will get better as a result.
  3. Less SEOs – I don’t think it will kill the industry, but measurement and keyword research becomes a lot harder. The focus will probably get more technical on-site and more social off-site. Google’s not known for its love of the SEO industry so this is probably a nice side benefit for them.

It’s easy to despair and mourn the death of SEO (again) but as long as organic results exist that’s pap. Read more »

Excel’s built-in web features are pretty frustrating when you want to do more with the web than import a static HTML table to a predefined set of cells.

I’ve often wanted to be able to update the contents of a cell based on dynamic parameters passed into a URL, and not found a decent, easy way of doing this. The official Office website shows you how to do this the Microsoft way, but lo and behold that doesn’t actually translate to real-world uses very well.

Say for example you want to fill a column of cells with the ranking for a given list of keywords, a function similar to that shown below (where the URL could potentially be defined in the E column) would be very useful: Read more »