Google Webmaster Tools Revamps Crawl Errors, But Is It For The Better?
Mar 12, 2012 at 10:04pm ET by Vanessa Fox
Google has just revamped the crawl errors data available in webmaster tools. Crawl errors are issues Googlebot encountered while crawling your site, so useful stuff!
Avoid crawling errors go custom website designs
I originally started this article by writing that in most cases, these changes are for the better and in only a few (really maddening) cases, useful functionality has been removed. But now that I’ve gone through the changes, I unfortunately need to revise my summary. This update is mostly about removing super useful data, masked by a few user interface changes. (And I hate to write that, because webmaster tools is near and dear to my heart.)
So what’s changed?
Site vs. URL Errors
Crawl errors have been organized into two categories: site errors and URL errors. Site errors are those which are likely site-wide, as opposed to URL-specific.
Site errors are categorized as:
- DNS – These errors include things like DNS lookup timeout, domain name not found, and DNS error. (Although these specifics are no longer listed, as described more below.)
- Server Connectivity – The errors include things like network unreachable, no response, connection refused, and connection reset. (These specifics are also no longer listed.)
- Robots.txt Fetch – These errors are specific to the robots.txt file. If Googlebot receives a server error when trying to access this file, they have no way of knowing if a robots.txt file exists, and if so, what pages it blocks, so they stop the crawl until they no longer get an error when attempting to fetch it.
- Server error – These are 5xx errors (such as 503 for server maintenance)
- Soft 404 – These are URLs that are detected as returning an error page but don’t return a 404 response code (they typically have a response code of 200 or 301/302). Error pages that don’t return a 404 can hurt crawl efficiency as Googlebot can end up crawling these pages instead of valid pages you want indexed. In addition, these pages can end up in search results, which is not an ideal searcher experience.
- Access denied -These are URLs that returned a 401 response code. Often this simply means that the URLs prompt for a login, which is likely not an error. You may, however, want to block these URLs from crawling to improve crawl efficiency.
- Not found – Typically, these are URLs that return a 404 or 410.
- Not followed – These appear to be 301 and 302 status codes. This report used to list URLs that weren’t followed for reasons such as too many redirects or redirect loops. A 301 shouldn’t be classified as “not followed” since it’s followed just fine. (So is this report really a list of 301s or is it a list of 301s with problems? No idea.) See “what’s gone missing” below for more on this.
- Other – This is a catch-all that includes status codes such as 403.
Trends Over Time
Google now shows trends over the last 90 days for each error type. The daily count seems to be the aggregate count of how many URLs with that error type Google knows about, not the number crawled that particular day. As Google recrawls a URL and no longer gets the error, it’s removed from the list (and the count).
In addition, Google still lists the date Googlebot first encountered the error, but now when you click the URL to see the details, you can see the last time Googlebot tried to access the URL as well.
Priorities and Fixed Status
Google says they are now listing URLs in priority order, based on a “multitude” of factors, including whether or not you can fix the problem, if the URL is listed in your Sitemap, if it gets a lot of traffic, and how many links it has. You can mark a URL as fixed and remove it from the list. However, once Google recrawls that page, if the error still exists, it will return to the list.
Google suggests using the Fetch as Googlebot feature to test your fix (and in fact now has a button right on the details page to do so), but since you are allowed only 500 fetches per account (not per site) each week (which I believe has increased from the previous limit), you should use this functionality judiciously.
What’s Gone Missing?
Unfortunately, several pieces of important functionality have been lost with this change.
- Ability to download all crawl error sources. Previously, you could download a CSV file that listed URLs that returned an error along with the pages that linked to those URLs. You could then sort that CSV by linking source to find broken links within your site and had an easy list of sites to contact to fix links to important pages of your site. Now, the only way to access this information is to click on an individual URL to view its details, then click the Linked From tab. There seems to be no way to download this data, even at the individual URL level.
- 100K URLs of each type. Previously, you could download up to 100,000 URLs with each type of error. Now, both the display and download are limited to 1,000. Google says “less is more” and “there was no realistic way to view all 100,000 errors—no way to sort, search, or mark your progress.” Google is wrong. There were absolutely realistic ways to view, sort, search, and mark your progress. The CSV download made all of this easy using Excel. And more data is always better to see patterns, especially for large scale sites with multiple servers, content management systems, and page templates. A lot has been lost here.
- Redirect errors – Inexplicably, the “not followed” errors no longer seem to list errors like redirect loop and too many redirects. Instead it simply lists the response code returned (301 or 302). This seems weird to me (not to mention extraordinarily less useful) as 301s are followed just fine and typically aren’t an error at all (and 302s are only sometimes problematic), but all the redirect errors that used to be listed are critical to know about and fix. Listing URLs that return a 301 status code as “not followed” is misleading and alarming for no reason. And if this list of URLs is actually those with redirect errors, then omitting what that error is (such as too many redirects) makes this data incredibly non-useful.
- Specifics about soft 404s. The soft 404 report used to specify whether the URLs listed returned a 200 status code or redirected to an error page. But the status code column appears to be empty now.
- URLs blocked by robots.txt . Google says they removed this report because “while these can sometimes be useful for diagnosing a problem with your robots.txt file, they are frequently pages youintentionally blocked”. They say that similar information will soon be available in the crawler access section of webmaster tools. Why remove data you’re planning to replace before replacing it? Couldn’t they have just moved this report to the crawler access section? I get the feeling that they won’t be replacing this report as is, but providing less granular data in its place. While it’s true that this report didn’t list errors necessarily, it was very useful. You could skim the CSV to see if any sections of pages you expected to be indexed were blocked. And it was critical for diagnosis. Why aren’t certain pages indexed? You could check this report before spending extensive time debugging the issue. But now you can’t do either of those things.
- Specifics about site level errors. The previous version of these reports listed the specific problem (such as DNS lookup timeout or domain name not found). That was very helpful in digging into what was going on. Now, you only get the count for the general category, not the specifics of what kind of error it was within that category.
- Specific URLs with “site” level errors. Google says you don’t need to know the URL if the issue was at the site level. Mostly, this is likely true. But I’ve definitely encountered cases, particularly with DNS errors, that the error only happened with specific URLs, not the entire site. Knowing the URL that triggered the error would help track down issues in these cases.
I have questions into Google asking about the removed functionality (particularly the confusing changes such as the Not Followed errors) and I’ll update this story as I hear back. If, in fact, the data still exists, but I’m just overlooking it, I’ll be happy to update this story to say that I’m wrong.
I get the sense that many of these recent changes are designed to make the data easier for small site owners to use, and don’t really have the large enterprise-level site (or agency) in mind. For these latter organizations, more data is better, as we have systems to parse and crunch the data.
Of course, in part, I’m sad to see features that I worked hard on launching when I was product manager for webmaster central be dismantled and made less useful. But mostly, as a frequent user of the product, I don’t want to lose useful functionality.
About The Author: Vanessa Fox is a Contributing Editor at Search Engine Land. She shares her perspectives on how search impacts marketing, user experience, and web development at ninebyblue.com. Her book, Marketing in the Age of Google, provides a blueprint for incorporating search strategy into organizations of all levels. Follow her on Twitter at @vanessafox.