Test to verify that pages are or are not blocked from access by search engines.
Robots Exclusion Protocol
The robots exclusion protocol is a voluntary system that webmasters use in order to communicate to search engine spiders which pages on their site they do or do not want indexed.
Because these messages are by design only meant to be seen by the automated crawlers that search engines send out, it can be difficult to debug whether a page is visible to search engines or not, especially since there are several methods of communicating these preferences to search engines:
robots.txt files must sit at the root of a domain or sub-domain.
Google offers a robots analysis tool within its webmaster Tools, however this is limited for a few reasons:
- You are limited to 5,000 characters' worth of URLs in a single request (ie not that many)
- It only checks against robots.txt, leaving 2 other potential search engine blocking avenues unchecked
- There's no easy way to download results to a spreadsheet
robots meta tags
robots meta tags are an HTML level option that can be set per-page, and doesn't show the world the pages you don't want crawled quite as easily as the public-facing robots.txt file.
You can find this in the <head> tag of a page by looking for something similar to this:
<meta name="robots" content="noindex">
robots HTTP headers
Often overlooked when troubleshooting as this is less visible to users, and is left out of most SEO tools because it is not as commonly used as the two options above.
A robots exclusion HTTP header looks like the following: