An SEO professional was concerned about crawling and their crawl budget.
They have a WordPress website which generates multiple URLs automatically – these are unwanted URLs.
They are wondering if there is a way they can stop the crawler to figure out exactly what these URLs are.
They are aware they can noindex the URLs. But then, they still see them in Google Search Console under the Excluded URLs report.
John explained that they shouldn’t be worrying about the crawl budget on a site that has 10,000 URLs. Google will crawl this amount of URLs pretty quickly. This typically tends to happen within a day.
Also, they have to crawl the page in order to see the noindex meta tag. So, you can’t avoid having Google crawl the page, unless you don’t want them to see the noindex meta tag. So Google will need to check these pages from time to time.
However, the side effect of the noindex tag is that over time, Google will crawl those pages less often. They will still double check every now and then. But they will not check as much as a normal page that is otherwise indexed.
One other approach is to use robots.txt and just block the crawling of the pages completely.
The main disadvantage in doing this is that the URL itself will be indexed in the search results, not any of the content of the page, but only the URL alone.
Also, if the content is pretty similar to other pages that you have created in the past, and you have some pages blocked in robots.txt, and other pages are not, then people who are searching for the pages directly could still come across these pages.
Practically speaking, though, having a noindex directive or a robots.txt block would be equivalent, in the sense that the content won’t appear in the search results. But make no mistake: if it’s blocked from crawling in robots.txt, and there is a noindex on the page, Google will pick one or the other. They will not crawl the noindex if they cannot crawl the page because it’s blocked in robots.txt.
This happens at approximately the 46:04 mark in the video.
John Mueller Hangout Transcript
SEO Professional 7 46:04
Yeah, I hope I’m audible; noises clear. Okay, so it’s a good question about crawling. The website I’m talking about is a WordPress website, it automatically generates multiple URLs, unwanted URLs. What I’m wanting to understand: is there a way that I can stop the crawler to find out what these URLs are?
I know, I can noindex, you know, those are all no indexed URLs. But then I can see them on the Search Console under the excluded part. I mean, is this going to affect the crawl budget of the website, because it’s a news website, we have 1000s of URLs. And if Google crawler is crawling unwanted URLs, don’t they?
It’s not getting indexed. But is it? Is there a way to stop it? And is it gonna affect the crawling budget?
John 46:51
How big is your website overall?
SEO Professional 7 46:56
Um, more than 5,000 to 10,000 URLs…
John 47:00
Okay. Okay, so I think at that size of a website, and on the one hand, I would not worry about the crawling budget. Because we can crawl that many pages fairly quickly, usually within a day. Yeah. But the other thing, I think, with regards to no index is the the no index is a meta tag on the page. So we have to crawl the page to see the meta tag, which means you can’t avoid that we check the no index pages. We have to check them from time to time.
However, if we see that there’s a no index on the page, then usually over time, we crawl those pages less often. So we will still double check every now and then. But we won’t check as much as a normal page that is otherwise indexed. The other approach is to use robots.txt, with the robots. txt file, you can block the crawling of those pages completely. The disadvantage is that sometimes the URL itself can be indexed in the search results, not the content on the page, but just the URL alone.
And if the content is otherwise similar to things that you have on your website, so if you, I don’t know, for example, have a football news website, and you have some articles that are blocked and some articles that are allowed for crawling, then if someone is searching for football news, they will find the indexable versions of your pages.
And it won’t matter that there are other pages that are blocked by robots.txt. However, if someone explicitly does a site query for those blog pages, then you would be able to see those URLs in search. But again, for the normal user, they’re basically not visible. So from that point of view, in a situation like yours, on the one hand, I would not worry about the crawl budget.
And from a practical point of view, both the no index and the robots.txt would be kind of equivalent, in the sense that this content would probably not appear in the search results. And like, we would still need to crawl it if there’s a noindex, but the numbers are so small that they don’t really matter. We might still index it with a URL if they’re blocked by robots.txt.
But again, if you have other content on your website that’s similar, it doesn’t really matter. So which one should you choose? I, I don’t know. I would choose the one that is easier to implement on your side. And if for example, you have WordPress and you can just have a checkbox on the posts that says this page, no index, maybe that’s the easiest approach.
It kind of depends on what kind of setup that you have.