During a hangout, one SEO Professional asked John Mueller during the submitted Question and Answer segment about why Google indexes parameter URLs.
The question was: why do parameter URLs end up in Google index even though they have excluded them from crawling with the robots.txt file, and with the parameter settings in Google Search Console?
How do they get parameter URLs out of the index again, without endangering the canonical URLs?
John explained that there’s likely a general assumption that parameter URLs are bad for a website.
This is not the case. So, it’s definitely not the case that you need to fix the indexed URLs of your site to get rid of all parameter URLs.
From this perspective, John would see it as something where you’re basically polishing the site a bit to make it better. But, it’s not something where it’s critical.
Regarding the robots.txt file, and the parameter handling tool, usually the parameter handling tool is the place where you can do these things.
John thinks that the parameter handling tool is a bit hard to find and harder to use by most people.
So personally, he would try to avoid that.
Instead, use the more scalable approach and the robots.txt file. But, you’re welcome to use it in Google Search Console.
With the robots.txt file, you’re preventing crawling of these URLs, you’re not preventing indexing of these URLs. And this means that if you do something like a site: query for those URLs, it’s highly likely that you will still find these URLs in the index, even without the content itself being indexed.
This happens at approximately the 30:31 mark in the video.
John Mueller Hangout Transcript
John (Submitted Question) 30:31
Then a question about parameter URLs. Why do parameter URLs end up in Google’s index, even though we’ve excluded them from crawling with the robots.txt file, and with the parameter settings in Search Console? How do we get parameter URLs out of the index again, without endangering the canonical URLs?
John (Answer) 30:49
So I think there’s a general assumption here that parameter URLs are bad for a website. And that’s not the case. So it’s definitely not the case that you need to kind of like fix the indexed URLs of your website to get rid of all parameter URLs. So from that point of view, it is like I would see this as something where you’re kind of like polishing the website a little bit to make it a little bit better. But it’s not something that I would consider to be critical.
With regards to the robots.txt file, and the parameter handling tool, usually the parameter handling tool is the place where you could do these things. My feeling is, the parameter handling tool is a little bit hard to find and hard to use by people. So personally, I would try to avoid that and instead use a more scalable approach in the robots.txt file. But you’re welcome to use it in Search Console.
With the robots.txt file, you essentially prevent crawling of those URLs, you don’t prevent indexing of those URLs. And that means that if you do something like a site query for those specific URLs, it’s very likely that you’ll still find those URLs in the index, even without the content itself being indexed. And I took a look at the forum thread that you started there, which is great. But there, you also do kind of this fancy site query where you pull out these specific parameter URLs.
And that’s something where if you’re looking at URLs that you’re blocking by robots.txt, then I feel that is a little bit misleading, because you can find them if you look for them. But it doesn’t mean that they cause any problems, or it doesn’t mean that there’s any kind of issue that a normal user would see in the search results. So just to elaborate a little bit, if there’s some kind of term on those pages that you want to be found for.
And you have one version of those pages that is indexable and crawlable. And another version of the page that is not crawlable, where we just have that URL indexed by itself. If someone searches for that term, then we would pretty much always show that page that we actually have crawled and indexed. And the page that we theoretically also have indexed because it has–it’s blocked by robots.txt, and theoretically, it could also have that term in there.
That’s something where it wouldn’t really make sense to show that in the search results, because we don’t have as much confirmation that it matches that specific query. So from that point of view, for normal queries, people are not going to see those robotic URLs. And it’s more if someone searches for that exact URL, or does a specific site query for those parameters, then they could see those pages.
If if it’s a problem that these pages are findable in the search results, then I would use the URL removal tool for that if you can, or you would need to allow crawling and then use the no index directive, robots.txt directive, to tell us that you don’t want these pages indexed. But again, for the most part, I wouldn’t see that as a problem. It’s not something where you need to fix that with regards to indexing. It’s not that we have a cap on the number of pages that we index for a website. It’s essentially we’ve seen a link to these, we don’t know what is there, but we’ve indexed that URL should someone search specifically for that.