John Mueller had a question asked by an SEO professional during a Question and Answer segment in a hangout, about adding all the meta tags to a page, even though the pages are blocked by robots.txt.
Their main question is: should they add a noindex tag to a page even though the page is blocked by robots.txt, and the page has a canonical as well?
John answers: probably not. He explained that, if the URL is blocked by robots.txt, Google will not see any of the meta tags on the page.
Google won’t see the rel=canonical tag on the page, because if it’s blocked by robots.txt, then Google won’t crawl that page at all.
If you want Google to take into consideration the rel=canonical or noindex that you add to a page, you have to make sure that Google can crawl the page itself.
The other aspect here is that often, these pages may get indexed, if they’re blocked by robots.txt, but they’re indexed without any of the content, because Google can’t crawl it.
And usually this means that these pages don’t show up in the search results anyway. So if someone is searching for some kind of product that you sell on your site, then they’re not going to dig and see if there’s also a page that is blocked by robots.txt, which would be relevant because they already have really good pages from the website that they can index normally and can show.
On the other hand, if one does a site:query for that specific URL, then perhaps you will still see that URL in the search results, but without any content.
This happens at approximately the 44:23 mark in the video.
John Mueller Hangout Transcript
John (Submitted Question) 44:23
We are having a problem with e-commerce assets, filters that are getting indexed, even though they’re blocked by robots.txt and have a canonical tag that points…Is there a point in adding noindex tags too?
John (Answer) 44:38
So probably not. The short answer, I guess, is if the URL is blocked by robots.txt, then we don’t see any of those meta tags on the page. We don’t see the rel=canonical on the page because we don’t crawl that page at all. So if you want us to kind of take into account the rel=canonical or take into account a noindex that you put on a page, you need to make sure that we can’t, or that we can crawl the actual page itself.
The other aspect here is that oftentimes these pages may get indexed, if they’re blocked by robots.txt, but they’re indexed without any of the content, because we can’t crawl it. And usually, that means that these pages don’t show up in the search results anyway. So if someone is searching for some kind of product that you sell on your website, then we’re not going to kind of dig and see if there’s also a page that is blocked by robots.txt, which would be relevant, because we already have really good pages from your website that we can crawl and index normally and that we can show.
On the other hand, if you do a site query for that specific specific URL, then maybe you’ll still see that URL in the search results without any content. So a lot of times, what I noticed is that this is more of a theoretical problem than a practical problem. And that theoretically, these URLs can get indexed without content, but in practice, they’re not going to cause any problems in search. And if you do see them showing up for practical queries on your website, then most of the time, that’s more a sign that well, the rest of your website is really hard to understand.
So if someone searches for one of your product types, and we show one of these roboted kind of category or asset pages, then from from my point of view, that would be a sign that well, actually, the visible content on your website is not sufficient for us to understand that the normal pages that you could have indexed are actually relevant here.
So that would be kind of my first step there is to try to figure out, do normal users see these pages when they search normally? And if they don’t see them, then that’s fine. You can just ignore them. If they do see these pages when they search normally, then that’s a sign that maybe you should be focusing on other things on the rest of your website.