One of the truly great things about Perl is CPAN (the Comprehensive Perl Archive Network), which is an immense resource for almost all of the common (and not so common) programming functions you could ever dream of – from the web to graphics and operating system interfaces. Although Python and Ruby are gaining in popularity these days, CPAN is a huge asset to Perl that (as far as I’m aware) has few equals in other languages.

I’ve collected below some of the most useful modules I’ve found from an SEO’s point of view:

1. WWW::Mechanize

WWW::Mechanize is described as “handy web browsing in a Perl object”. It’s an immensely powerful scraping, crawling and HTML parsing tool, and supports cookies, browsing history, proxies, custom headers and more. It’s a subclass of LWP::UserAgent, so many of the functions in that module will also work here. There’s a great FAQ available on CPAN, as well as examples of what you might use it for.

It’s very simple to knock up fairly advanced tools in several lines of code – for example the snippet below will print all of the on-page links from a list of URLs:

#!/usr/bin/perl -w
use strict;
use WWW::Mechanize;

my @urls = qw(http://www.bbc.co.uk/ http://searchtalk.co.uk/);

foreach my $url (@urls) {
  my $mech = WWW::Mechanize->new();
  $mech->get($url);
  $mech->dump_links();
}

The amount of things you can do with this module are pretty much limitless – aside from rendering JavaScript, Flash etc, anything you do in your browser can be automated through the use of this module. For example you can create your own APIs into services such as Google Webmaster Tools or Google Insights where the current API options are limited, and there are many other awesome applications that others have built off the back of this module. For further reading I’d recommend the book Spidering Hacks – most of the examples are out of date now but the concepts are pretty easy to adapt to other websites.

2. HTML::TokeParser

HTML::TokeParser essentially treats an HTML page as a series of “tokens”, rather than plain text that you run regular expressions over. This makes it a lot more robust in handling invalid or inconsistently formatted HTML, and is closer in concept to how search engines treat HTML pages. A mistake many people make is to recommend valid HTML as an SEO recommendation, while the reality is search engines don’t care, because they don’t treat HTML as well-formed XML, and so don’t break when a quotation mark is out of place.

There are a bunch of newer modules out that use XPath selectors to parse HTML, which in my experience are a bit easier to use, though perhaps not quite as powerful.

3. URI

URI is an essential module in manipulating URLs, converting relative URLs to absolute, etc.

4. Scrappy

Finding the XPath/CSS3 selector is easy with Firebug

Scrappy is a truly awesome module that integrates the WWW::Mechanize and Web::Scraper modules to make scraping and crawling even easier. One of the best features is that you can use XPath or CSS3 selectors to extract info from a webpage rather than labouring over increasingly complex regular expressions. It makes crawling a sych as well, and supports multi-threaded crawling for speeding up your scripts. Writing a very basic multi-threaded crawler is as simple as:

crawlers 10, $starting_url, {
'a' => sub {
# find all links and add them to the queue to be crawled
queue shift->href;
}
};

5. Net::Whois::Raw

Quick & easy whois data gathering – for example:

#!/usr/bin/perl -w
use strict;
use Net::Whois::Raw;

print "Enter domain: ";
my $dom = <STDIN>;
chomp($dom);
my $dominfo = whois($dom);

print $dominfo;

6. WWW::Google::PageRank

WWW::Google::PageRank is a great little PageRank pinger that does exactly what it says on the tin – programmatically fetches the PageRank of any URL passed to it :)

7. Geo::IP

Geo::IP is another simple tool that looks up an IP’s country location – useful for all sorts of SEO tools.

8. Spreadsheet::XLS

Tired of exporting plain old CSV files from your tools? Want to export your shiny new SEO reports in an Excel format? Easy, just use Spreadsheet::XLS – it’s surprisingly simple to generate spreadsheets with multiple tabs,┬árich formatting and more. There’s also a module in development for the newer XLSX file format.

9. Parallel::ForkManager

Parallel::ForkManager is a simple parallel processing module, which means you can add multi-threading to your code and speed up your scripts and scrapers in seconds.

10. LWP

LWP (Library for WWW in Perl) is perhaps the most well established interface to the web in Perl, with the most used module within it being LWP::UserAgent. However it is perhaps not quite as “plug & play” to use as some of the alternatives like WWW::Mechanize or Scrappy. There is a whole book dedicated to this set of modules – if you’re interested in learning more about scraping and crawling in Perl I’d definitely recommend it.

11. DBI

If you’re going to build SEO tools, you’ll need to interact with a database once you reach a certain level of complexity. DBI is an essential module for interacting with different databases. Most SEOs are probably more familiar with MySQL than other DB types, which DBI handles easily & securely.

APIs

I haven’t mentioned any API modules in this list, although there are around 5,000 listed API modules listed on CPAN, including for Facebook, Twitter, Flickr and other less well known services such as Wordnik.

Have you got any favourites I’ve missed? Share them below!

Comments

  1. Compact Flash Recovery

    Rob,
    i had to Bookmark your site just now. I was not aware of the CPAN WWW::Google::PageRank. i am also taking a quick look at the Facebook API’s. Thanks for the share of info on these modules.