These days most crawling and scraping tools use XPath or increasingly CSS3 selectors to parse HTML rather than the lengthy & obtuse regular expressions of old.
However unlike regexes, there’s not much in the way of online testing tools for CSS3 selectors (ie none that I could find), so using the excellent CSS selector support in Mojolicious I knocked up a simple CSS3 selector tester (say that 3 times fast) in ~1hr. Read more »
I just noticed that my blog’s been dropped by Google for the past few weeks:
Slightly odd since I haven’t changed anything or even logged in for a month or so. My first instinct was to check Google Webmaster Tools for messages or outages of the blog since I have been doing some crawl testing on the site; the only thing I found was a rather weird notice about a few pages showing “soft 404s”, which seems unlikely, doesn’t seem to be an issue now, and given it appeared ~20 days before the penalty/filter, probably isn’t the culprit. Read more »
I’ve mentioned before that having a pet project is a key incentive to learn more in the world of coding.
I wanted a bookmarking service that was better than delicious.com and allowed full text search, and there didn’t seem to be any out there that I liked, so I decided to try and build one in my spare time. When I started on bkmrx.com, beyond that vague notion, I had no idea how all the pieces would slot together. I had to do a lot of research, planning and scoping when working on it – below are 20(ish) cool projects I found in the course of building it: Read more »
TL;DR: check out bkmrx.com
Online bookmarking is a pain. Most services out there treat bookmarks like their offline equivalents – linear, added once, and once you’ve added a certain amount of them, there’s no way of recovering the page or information you suddenly now want to retrieve without going through all of them one by one.
To a certain extent, Google solved this problem for a while but now with personalisation, user profiling, constantly shifting algorithms and +ification it’s arguably now harder to find what you’re looking for instead of what Google wants you to be finding. You certainly can’t be confident that the same search you perform today will return the same results tomorrow, and recalling the right combination of keywords you used to find something can be difficult. For certain topics, such as academic or niche searches, Google seems a bit too much noise & not enough signal these days to be useful, even when you know the right result is out there. Read more »
As briefly touched upon in a previous post straight after the launch of Firefox 14 (ie the advent of Google’s SSL search being on by default), the amount of (not provided) traffic has increased across an aggregate of a number of our sites.
Now a couple of weeks after its launch Read more »