SECrawl_-_Enterprise_Website_Crawler__built_on_searchDoes anyone need yet another website crawler?

There’s no shortage of crawlers out there, and for most people, my web-based SEO crawler, Screaming Frog or URL Profiler is more than enough.

There is a gap though – no two websites are alike, yet we look at the same metrics and cookie-cutter recommendations again and again. Or vast tables of complex text data that Excel was never built to analyse or manipulate.

Introducing SECrawl

SECrawl is a search engine, and an enterprise-level website analysis tool that offers: Read more »

pandaIn the age of Google’s Panda algorithm, fixing issues with thin and duplicate content is more important than ever. However this can be a daunting task when faced with a big site that has millions of pages and potentially thousands of duplicate pages, or pages low on unique content – just sifting through and identifying this content can be a big task before even addressing it.

In order to address some of these challenges you can now use my free duplicate & thin content checker tool.

The tool is pretty simple to use: Read more »

SEMalt is well known as a major referral spammer who clog up the analytics of a lot of small and medium sized websites with fake traffic. Annoyingly Google Analytics seems to be pretty bad at filtering it out.

There’s a quite a few posts from miffed webmasters about SEMalt’s activities detailing how to block them, and even a couple of WordPress plugins, but here’s a quick way of blocking them (and others) in nginx and getting a tiny bit of revenge. Read more »

Facebook_and_The_Information_Google_Is_Gathering_About_Us_Is_Terrifying_-_Business_InsiderIt used to be that writing page titles & meta descriptions was pretty easy – just keep an eye on the character count and you’re good.

However, thanks to recent changes in Google’s SERP layout, and the proliferation of share buttons that grab the same meta data, it’s increasingly hard to strike the balance of a good headline and a good description.

As can be seen from the Facebook post pictured, getting truncated by even one word can strip a headline of its meaning and power. The missing word is ‘terrifying’, which would make it far more compelling and shareable – without it both the headline and description are pretty boring.

In order to combat this, I’ve built a new real-time optimization tool that allows you to see an approximation of what your page title & description might look like in Google, Facebook & Twitter.

Check out the Page Title optimization preview tool. Read more »

photo by mastahanky

photo by mastahanky

It hit the news last week that the UK Conservative Party deleted a whole host of historical content from their website, going back (conveniently) from just before the May 2010 election.

Although the original Computer Weekly article arguably overblew this somewhat by conflating adding lines to a robots.txt file with criminality, it did highlight an important action taken by the webmasters, which on the face of it seems fairly reasonable – the public doesn’t own website content or pay for its hosting, so it’s perfectly within their rights to remove content.

However when the website in question is that of a political party, there should be an implied responsibility to keep content live for historical purposes, rather than attempting to hide content that may prove uncomfortable during elections. Of course trying to remove historical evidence of political dialogue has disturbing implications.

One of the wonderful (or challenging) things about the world wide web however, is that simply deleting content from it isn’t that easy; data has a nasty habit of persistence. For a start, search engines keep a local ‘cached’ copy of most pages that they visit, nefarious webmasters scrape content to repurpose on their own sites, websites get mirrored, and content gets shared.

Sadly the changes on the Tory party website were identified too late to grab any meaningful cached copies of these pages from the major search engines, (ie before the robots.txt changes were picked up by ‘polite’ search engines such as Google and Bing, and they removed their cached copies of the pages).

One good point raised by Computer Weekly was that by adding these (now removed) lines to the robots.txt file, this also had the effect of retrospectively wiping them from the Internet Archive‘s database, which attempts to keep a historical record of important sites on the web.

However, after a bit of digging in some other places, I have managed to recover all of the speeches and some other content. I’ve put this up on dropbox (400+MB d/l) for now – I encourage you to download it and do your own analysis; here’s a few things I found… Read more »