SEMalt is well known as a major referral spammer who clog up the analytics of a lot of small and medium sized websites with fake traffic. Annoyingly Google Analytics seems to be pretty bad at filtering it out.

There’s a quite a few posts from miffed webmasters about SEMalt’s activities detailing how to block them, and even a couple of WordPress plugins, but here’s a quick way of blocking them (and others) in nginx and getting a tiny bit of revenge. Read more »

Facebook_and_The_Information_Google_Is_Gathering_About_Us_Is_Terrifying_-_Business_InsiderIt used to be that writing page titles & meta descriptions was pretty easy – just keep an eye on the character count and you’re good.

However, thanks to recent changes in Google’s SERP layout, and the proliferation of share buttons that grab the same meta data, it’s increasingly hard to strike the balance of a good headline and a good description.

As can be seen from the Facebook post pictured, getting truncated by even one word can strip a headline of its meaning and power. The missing word is ‘terrifying’, which would make it far more compelling and shareable – without it both the headline and description are pretty boring.

In order to combat this, I’ve built a new real-time optimization tool that allows you to see an approximation of what your page title & description might look like in Google, Facebook & Twitter.

Check out the Page Title optimization preview tool. Read more »

photo by mastahanky

photo by mastahanky

It hit the news last week that the UK Conservative Party deleted a whole host of historical content from their website, going back (conveniently) from just before the May 2010 election.

Although the original Computer Weekly article arguably overblew this somewhat by conflating adding lines to a robots.txt file with criminality, it did highlight an important action taken by the webmasters, which on the face of it seems fairly reasonable – the public doesn’t own website content or pay for its hosting, so it’s perfectly within their rights to remove content.

However when the website in question is that of a political party, there should be an implied responsibility to keep content live for historical purposes, rather than attempting to hide content that may prove uncomfortable during elections. Of course trying to remove historical evidence of political dialogue has disturbing implications.

One of the wonderful (or challenging) things about the world wide web however, is that simply deleting content from it isn’t that easy; data has a nasty habit of persistence. For a start, search engines keep a local ‘cached’ copy of most pages that they visit, nefarious webmasters scrape content to repurpose on their own sites, websites get mirrored, and content gets shared.

Sadly the changes on the Tory party website were identified too late to grab any meaningful cached copies of these pages from the major search engines, (ie before the robots.txt changes were picked up by ‘polite’ search engines such as Google and Bing, and they removed their cached copies of the pages).

One good point raised by Computer Weekly was that by adding these (now removed) lines to the robots.txt file, this also had the effect of retrospectively wiping them from the Internet Archive‘s database, which attempts to keep a historical record of important sites on the web.

However, after a bit of digging in some other places, I have managed to recover all of the speeches and some other content. I’ve put this up on dropbox (400+MB d/l) for now – I encourage you to download it and do your own analysis; here’s a few things I found… Read more »

ukstorm tweetsOne of the many great things about Mozilla’s MozFest is that it is centred around making, interacting and hands-on learning rather than sitting back and being ‘entertained’ as is the way with most one-directional conferences I’ve been to.

During the session on the Journalist’s Toolbox on Sunday, we were asked to get into groups and brainstorm how we’d use these tools to help tell a news story. Together with @MarcusAsplund and @TomWills we decided to focus on the big storm that was due to hit the UK on Monday. After a bit of cursory research, we found the hashtag #ukstorm was being used frequently to refer to it on Twitter. Read more »

air-mozillaThis week the Telegraph published a blog post about why we shouldn’t teach kids to code, because coding is a boring hobby for dull weirdos who make dull & boring things (I paraphrase, only slightly). I don’t feel the need to fuel the flames by responding to the article (just read the comments/look on twitter), but you couldn’t get a much clearer illustration of how wrong this is than by going to Mozilla’s Mozfest, which kicked off yesterday.

The opening “Science Fair” (with free beer & everything) is a chance for coders/hackers who are passionate about the (open) web to show off their pet projects. It’s a truly impressive display of innovation, imagination and the real, practical applications that knowledge of (even a little bit of) coding can be put to when mixed with a bit of imagination. Below is a brief run-down of just some of the projects on show.

BBC News Labs (@bbc_news_labs)

The BBC was showing off its “Juicer” tool, which utilises Natural Language Processing (NLP) (didn’t catch which system but Open Calais was mentioned) and scrapes then extracts entities and events from newly published content in order to try and bring the BBC’s antiquated publishing system into the 21st Century by creating a semantic catalogue of topics for journalists to use when writing articles, and ideally then for readers to use when browsing or searching for content. Update: found more info on this project here and here. Read more »