During the Q&A portion of John Mueller’s 09/17/2021 hangout, one webmaster asked about a steady increase of random 404s that were not a part of their website.
The pages don’t exist in their sitemap, and they did not get generated by the internal search. Because of this, the webmaster believed that it’s Google’s searches that are being appended to their URLs that Google is trying to crawl.
The webmaster wanted to know how to make sure these URLs don’t impact their overall crawlability and indexability.
John explained that Google doesn’t make up URLs. It’s likely that these are random links they found on the web—probably from some scraper site that’s scraping things in a bad way.
When they find these links, they crawl them, see that they return a 404, and start ignoring them.
John said that it’s not something that the webmaster actively has to take care of. If the URLs don’t exist on the actual website, that’s fine.
This conversation occurs at the 24:40 mark in the video.
John Mueller 09/17/2021 Hangout Transcript
Our Search Console crawl stats report is showing a steady increase of 404 pages that are not a part of our site. They don’t exist in our sitemap nor are they generated by internal search. They appear to be Google searches that are being appended to our URLs and Google is trying to crawl. Under the crawl response breakdown these 404s make up over 40% of the crawl response, how do we make sure this doesn’t negatively affect our crawlability and indexability?
John 25:08 (Answer)
So I think, first of all, we don’t make up URLs. So it’s not that we will take Google searches and then make up URLs on your website. My guess is that these are just random links that we found on the web, maybe from some scraper site that is scraping things in a bad way, or something like that. So that’s something that happens all the time.
And we find these links, and then we crawl them, we see that they return 404, and then we start ignoring them. So in practice, this is not something that you have to take care of. If they’re 404. If they don’t exist, then that’s fine. That’s the way they it should be. And usually, what happens with these kinds of links is, we try to figure out overall, for your website, which URLs we need to be crawling in which URLs we need to be crawling at which frequency. And then we take into account, after we’ve worked out what we absolutely need to do, what we can do kind of additionally.
And in that additional bucket, which is also like a very, I think, like a graded set of URLs, essentially, that would also include things like random links from scraper sites, for example. So if you’re seeing that we’re crawling a lot of URLs on your site that come from these random, random links, essentially, you can assume that we’ve already finished with the crawling of the things that we care about that we think your site is important for. And we just have time and capacity on your server. And we’re just going to try other things as well. So from that point of view, it’s not that these 404s would be causing issues with the crawling of your website, it’s almost more a sign that well, we have enough capacity for your website.
And if you happen to have more content that you actually linked within your website, we would probably crawl and index that too. So essentially, it’s almost like a good time. And you definitely don’t need to block these by robots. txt, it’s not something that you need to suppress. It’s something, it’s just super common on the web that sites link in a random way or maybe they have broken HTML and then we end up discovering a bunch of URLs like this.