An SEO professional was concerned about crawling and their crawl budget.
They have a WordPress website which generates multiple URLs automatically – these are unwanted URLs.
They are wondering if there is a way they can stop the crawler to figure out exactly what these URLs are.
They are aware they can noindex the URLs. But then, they still see them in Google Search Console under the Excluded URLs report.
John explained that they shouldn’t be worrying about the crawl budget on a site that has 10,000 URLs. Google will crawl this amount of URLs pretty quickly. This typically tends to happen within a day.
Also, they have to crawl the page in order to see the noindex meta tag. So, you can’t avoid having Google crawl the page, unless you don’t want them to see the noindex meta tag. So Google will need to check these pages from time to time.
However, the side effect of the noindex tag is that over time, Google will crawl those pages less often. They will still double check every now and then. But they will not check as much as a normal page that is otherwise indexed.
One other approach is to use robots.txt and just block the crawling of the pages completely.
The main disadvantage in doing this is that the URL itself will be indexed in the search results, not any of the content of the page, but only the URL alone.
Also, if the content is pretty similar to other pages that you have created in the past, and you have some pages blocked in robots.txt, and other pages are not, then people who are searching for the pages directly could still come across these pages.
Practically speaking, though, having a noindex directive or a robots.txt block would be equivalent, in the sense that the content won’t appear in the search results. But make no mistake: if it’s blocked from crawling in robots.txt, and there is a noindex on the page, Google will pick one or the other. They will not crawl the noindex if they cannot crawl the page because it’s blocked in robots.txt.
This happens at approximately the 46:04 mark in the video.