In this episode of the Google Search Off The Record Podcast, Googlers John Mueller, Gary Illyes, and Lizzi Sassman discuss how one can block content from Google Search.
They talk about the following:
- Using robots.txt,
- The standard meta tags,
- How to block your site from Google utilizing different scenarios,
- Considerations about your CMS,
- Hosting platform considerations,
- What types of pages you would want to have indexed,
- When you would use the noarchive meta tag,
- How to use nosnippet,
- How to use noindex/nofollow,
- How to use maximage preview
Listen To the Podcast:
Search Off The Record Podcast Transcript
Hello, hello, and welcome to another episode of Search Off the Record, a podcast coming to you from the Google Search team, discussing all things search and maybe having some fun along the way. My name is Lizzie. And I’m joined today by some other folks on the Google Search team, Gary and John. Hi, Gary. No. Hi, John.
Hi, Lizzie. Great to be here.
Thank you. Wow, what a welcome. So today, I thought that we could talk about blocking content from search. Because you know what, I published this website, you may have heard about it, maybe last episode, machoguy.com. I commandeered it from another person here. And there’s.
Hey, wait wait, wait, what? That’s my website.
No, it’s my website.
No, I’m most certain. That’s my website.
I’m pretty sure I’m the macho guy. Yes. Because I–you know what, I searched for this recipe that I published. And it’s showing up in search. And I actually don’t want other people to see it. And I was wondering what I should do.
What, why did you publish it?
Well, I wanted some people to see it. Like maybe just the super followers, like a small set of people because it’s the secret recipe and I don’t want the word to get out broadly. Maybe just 20 people should see it.
What, but I am your super follower. Well, I am super.
Oh, but you’re not my follower.
I like soup.
This is going well.
Okay, okay. Okay. So have you considered removing the page?
Well, I don’t want to remove it. It’s an important recipe about how to make a really delicious cup of matcha. So I don’t really want to delete it altogether. Is there something else I could do?
Pig Latin, write it in Pig Latin. Write the recipe in Pig Latin and so…?
Write the recipe in Pig Latin? But my super followers don’t know Pig Latin, sadly.
Wait, how do you know that?
Oh, because I ran a survey.
You have interesting surveys. How come I didn’t get that survey?
Well, maybe you’re not my super follower.
Well, we already established that. I’m super.
That’s true. But not my follower, necessarily.
Oh, that. Yeah, that’s right. I forgot about that part.
That’s complicated, man. Right? I don’t know.
How about how, wait, wait, wait, how about if they are so super your followers? I mean, maybe they can remember a password. And then you can just put the whole thing behind a password.
That does sound like a good idea. But what if they can’t remember the password? Or I don’t really have that page, like a login page kind of setup. Is there another option that I could consider?
Passwords are super annoying, like someone could be phising the super follower and then secretly look at your matcha recipe?
Also, I’m okay with, like, if my super followers send it to their family, that that would be okay. So they should be able to share the link without having their family members login. So there’ll be a way that maybe it’s not password protected, but it’s not showing up in search. I don’t want just anyone.
Wait, wait, wait, wait, I had an idea. Okay. How about robots.txt? What’s that? We had an episode about Oh, you’re right…
I was on that episode. I don’t know why I asked that question.
Jeez. Okay. So tell us more. Gary, what do you, do you just like robots.txt the whole site, and then that page is gone?
What are you two talking about,, what’s happening? Like one is writing documentation. The other one was writing documentation about robots.txt. And now you’re asking me these weird questions. I don’t like to be here. I want to be gone.
Wow, we’re lucky that you’re here.
Well, so what do you do with robots.txt, you limit what crawlers can do on your website, right?
So you could put something like, hello, Googlebot, please don’t crawl this URL. And then Googlebot will not crawl that URL. So you could definitely do that in your robots.txt file, like disallow colon, and then path to the URL. Or, that’s the best part of the URL where you publish that recipe that you stole from me? Yeah, it’s a true story. It was 1897. And I made this recipe this matcha latte.
Maybe that’s why I don’t want it found in search. I didn’t want you to find out that I stole your recipe, right?
Yes. And you just admitted it. Anyway, so robots.txt.
Okay. Okay, so I would just upload robots.txt to my site and disallow the crawling for this webpage for the recipe on my site, and that would be perfectly fine. Is there any scenario where I would not want to do this, like am I missing something?
So I mean, with robots.txt you limit crawling, you don’t limit indexing. But in the very vast majority of the cases that will suffice. But then if your recipe became very popular, for example, and many people linked to it, then we might still index the URL, but not the content of the page. And in those cases, that recipe might still show up in search without description, for example, like without the web snippet. Google would infer a title link from the anchor texts that are pointing to your page, for example.
I see. Okay, so if my page became super popular, the result will still show up. Like it doesn’t mean that it’s not going to be blocked from Google search. If I disallow this recipe in my robots.txt, it’s not a failsafe.
It’s not a complete failsafe, but again, in the very vast majority of the cases, it will just work.
Okay. So if I wanted a failsafe method, is there something else that I could consider?
Robots meta tags? Yeah.
Yeah. Is there a meta tag that I could put on my page?
None, actually. I’m so sorry.
Wow, that was so nerdy. Oh, my God. Oh, that was great.
So essentially, you could use the noindex robots meta tag, but there is also the none robots meta tag, which is basically noindex plus nofollow. So…
Oh, so you can use none. And that is the same as doing noindex. And nofollow. Interesting. Okay. So why would I want to do both noindex and nofollow?
I don’t know. But like that’s just the abbreviation they came up with in the beginning, because at some point, they must have thought that HTML is such a concise page language that you have to save every character. So instead of writing out noindex, and nofollow, you could just write none. And it would be the same thing.
Interesting. Why wouldn’t they do something like ninth? Ninth ninth?
I don’t know. Maybe we should ask Larry or Sergey, or, like whoever made that up?
I’ve…probably Jeff, Jeff Dean, early engineer at Google.
I mean, it’s kind of weird, because HTML pages are just so messy and big overall, that saving a couple of characters. What is that going to change? But anyway
Do people still use the none tag today?
Probably? I have no idea. Do you have any numbers, Gary? More than seven?
I mean, I do have numbers. Seven is one of them. But I doubt that most of them would be related to this question.
Okay. So let me pose another question for you. Let’s say I have a lot of recipes on the matchaguy.com. And I want to make sure that they’re all noindex. Is there a way to do this, like at scale, like a generator or something?
Okay, first and foremost. Have you considered that you’re publishing wrong?
Potentially, but this is for this real, very real scenario for the podcast.
So most of your pages, you don’t want them shown in search? Or…
I guess this is like a bad example. You think yes.
Or maybe just not? So all right. We know our–
Alright, you told me that you wanted me to ask you about generators? In what case would people want to generate meta tags for them? Oh. I guess.
Okay. Yeah. So especially in the early days, when people wrote HTML themselves, in many cases, you you wouldn’t really know what meta tags to use. So it’s something like the these weird search engines didn’t have fantastic documentation at the time. And various blogs were interpreting the possible meta tags that were out there. And everyone was kind of like copying things from other people. There used to be, probably still are, these sites that would generate meta tags for you. So basically, you go there, you enter the keywords that are on your page, and then it would create a set of meta tags that you can just copy and paste onto your pages. So even if you didn’t really know what all of these meta tags did, you could put a bunch of them on your pages. And I think the weird part is that probably most of the Meta Tags mentioned there, could be technically correct, but they don’t have any functionality. So things like you could mention the location of your web server in a meta tag, or you could mention the name of the author of the page in a meta tag, but there’s like nothing useful that comes out of it. So you would go to the sites and the the meta tag generators will be like, oh, yeah, I have 27 meta tags I can generate for you. And of course, that would be better than the meta tag generator that just generated like five meta tags. So you copy those 27 meta tags to your pages, and hope that they don’t cause any problems, because you don’t really know what they’re doing.
Yeah, I do think that in those cases where you want a large chunk of your site not getting indexed, and you really want to go down the route of password protecting those directories for example, or organizing your site into like a more logical structure where you can put stuff that you want no indexed in a specific directory. And then you can use, for example, Apigee modules or edgings modules. I don’t think they are modules, I don’t know what they are called. To craft these configurations that will apply noindex to every single URL under pattern, or under a prefix like URL prefix. That is very technical, though, like way more technical than HTML, or robots.txt, both of which are well, especially robots.txt, which is extremely simple syntax. But in the vast majority of the cases, you probably can get help from your hoster like contacting support, and just asking them how to do that. Plus, if you’re using a hosting platform, like that uses Server Manager software like cPanel, then you can set these settings there.
That sounds very complicated. It’s like. It does. You just do all of this. I don’t know complicated physics, which is quite easy, actually.
Compared to quantum mechanics, for example, it’s super simple.
Well, for you, maybe. Alright, everyone, if you have questions about blocking half of your website, just contact Gary.
Yeah, I’m on Twitter at method e. I guess,
for for many sites nowadays, you would use your CMS, like WordPress, or whatever hosting platform that you have. And my guess is you just have an option there to password protect individual pages. Maybe that would be enough.
And then CMS is making sure that those login pages aren’t showing up in search? Or how would that work? Do you still need to worry about the actual login page not showing up?
I haven’t seen any problems from that. So my guess is it probably just works. But no idea. I mean, one, one thing you can do to double check is open it in an incognito window, and then looking at the source code. And it’s not 100% exactly what Googlebot might see in some cases, but at least you can double check to see, is there a robots meta tag there or not? And does it say noindex or none.
So on the login page topic, sometimes you actually want a login page indexed, because like, for example, if you have an online banking portal, for example, and you have a page where clients can log in, it’s very likely that people will actually look for that page, like how to log into my ebanking or whatever. And in those cases, you definitely don’t want the login page noindexed. So you want it to be indexed. So people can find the login page,. I remember a few years ago, local bank, or the bank that I use, for whatever reason, they decided that they want to put up a noindex on their login page. And then you had to hop through these hoops to get to the login page, like going to the homepage, and then there click another link that takes you to the ebanking portal. And then from the ebanking portal, you click another thing that will bring up the login modal dialog. And that felt just wrong. Because really, I just want to login to my banking account. And I don’t necessarily want to see all half of your website to do that. So in those cases, you probably want to allow the login pages index. Also, if you have any withdrawal followers or whatever they are called, then they might want the same thing, basically, how or where can I log in, so I can go to my member profile or whatever.
So So I guess kind of if most of your content is behind a login, then something should at least be indexed. Like it would be awkward if everything for matchaguy were noindexed, because then people wouldn’t be able to find it at all in search.
Well, I do have like another scenario that may be more relevant, I guess. We have on one Z, our Google Search Central site, the name that we call that site one Z. We’ve got a lot of blog posts there, they go back to like 2007. Which I thought maybe it would be a good idea to no index some of the pages like selectively if I thought that, hey, this page is outdated, we don’t really want people to be finding that in search, when they look up, say Duplicate content or something like “what does Google Search say about duplicate content?” And we might have 15 pages about this topic. Would it be a good idea to selectively noindex those really old pages that maybe they should be able to be found for historical purposes, but they’re not necessarily the most up to date answer on a certain topic.
I think this brings back nightmares. I kind of recall our discussion about this very long one way too long one, so I never want…no, that’s too harsh. I rarely want content out of search? Like, if we had a blog, for example, and in 2001, we published good content there, then I would expect that to be around pretty much forever. And this comes handy when like for this podcast, for example, we do quite a bit of internet archaeology, where we are trying to find, like, when was the first encounter of X tool or whatever. And for that purpose, I think it’s excellent. What we would want to do on one Z with a blog is using rel canonical to basically creating a cluster and pointing to a canonical page, like managing duplicate content or whatever for from the blog posts that are not that relevant anymore perhaps. Even in those cases, it’s sometimes tricky, because with a blog post, we can be way more verbose and way more chatty than with search documentation. And sometimes, that also means that we can give more examples, we can be more or less corporate, even, especially in the early days, we could be way more way less corporate, which means that some people might understand things better or easier. And for that purpose, I would really want things to be in the index, but maybe point with a banner to the canonical page.
So we could leverage the synergies of the archive content.
Spoken like a true manager!
Archive, that’s another word that’s in the meta tag documentation, no archive, when would I use that meta tag?
Yeah. John, when? Wow, archive?
Yeah, I, I don’t really know where this came from the name specifically. Because it’s not, it doesn’t really refer to the archive itself or making an archive process, it’s more that we wouldn’t show a cache page in the search results. And I imagine the other search engines also called it something similar, that’s like, the cache page that they would show with that little cached link in the search results sometimes. And with the no archive meta tag, you basically blocked that from appearing. Because search engines in order to actually index the content, they kind of have to keep an archive of the page internally, it’s just they don’t show it externally in cases like that. So it’s okay. It’s almost like, I don’t know, kind of like a snippet control meta tag, you block a specific part of the snippet from appearing.
Interesting, would this be for pages that are super fresh, like we only want the most up to date version of this page, like the archive is like not going to be helpful to people? So only the fresh one? Or is that not right?
I think it’s just for whatever people want to do, whether it’s like, if they don’t want to have a cache page shown, like for whatever reason, they can kind of block that. I’ve seen cases of that happening, when when they just really, really want people to go to the website itself for whatever reason, or when perhaps the content itself is behind a paywall or login wall or something like that, where traditionally, users would have to go through kind of the website login process to actually get the content. And they just want to make sure that people actually follow through on that and don’t have kind of this workaround by looking at the cache page.
You mentioned, also nosnippet, which would be, I guess, another way of blocking content when you might not necessarily want to block the entire page. But maybe you want to have some other level of fine tune control.
Yeah, I guess like in a case with a cooking website or recipe website. You could, for example, have something on your page, which is really unique to a recipe that you just don’t want to tell people ahead of time. It’s like you have this super secret ingredient that you put into matcha, which might be, I don’t know, ginger, or pepper or something. I don’t know, I don’t know what what secret ingredients people put into matcha. But like, you could have something like that. And then you could still have kind of like my secret matcha recipe as a title of a page. And within the text of the page, you could mention, like the secret ingredients. And if you block a snippet from being shown, then you can kind of prevent Google from showing part of the page as a snippet in the search results, which might help to kind of like keep your secret under the covers until people click through to your website. I don’t know, it depends on what you want to do with it.
And then you have more control with snippets as well, because you have the, I forgot what it’s called. But like how many characters you want. The maxsnippet. Also for the image, I think there’s something, Max image something.
Yeah, Max image size or something like a preview preview, like how large the image can be in the preview. That’s sort of interesting, because it’s not like it’s blocking Google from indexing that image. It’s more like block Google from serving the size of the image or something.
It’s, I think it’s more something in Google Discover where you would see that, where sometimes you have entries that have this giant image on top. And sometimes you have the smaller thumbnails. And if, for whatever reason, you’re like, I don’t want to have a big image shown because people should go to my page to see my big images, then that’s, that’s something you can set there.
What else do we have? We have nofollow? Of course, because reasons.
What does nofollow do right, where would you use that on matchaguy, Gary?
Yeah. Tell us about the links that you bought. What? I know, I saw, I saw the exchange, there was this shady person, and you were exchanging currency with that person? And I know that they gave you links in exchange of that currency.
Tell us about it.
So then I would nofollow that person. It’s like instead of a super follow, it would nofollow the person.
Oh, will follow.
Oh, yeah, yeah.
Like you can sign up.
Obliterate follow. No way.
So no follow just for this, the links that go outside my site?
It’s for all links on a page. Like if you I guess the tricky part is the nofollow robots meta tag would be for everything on the page. And on an individual level, you could nofollow individual links.
Okay. But what if I like linked to other recipes or something like within my matcha recipe?
That’s fine, that’s the way the web is. Like you link through other recipes to other people’s sites. I think the problem would just be if someone came to you and said, I would like to buy an ad on your website, and you link to their website and exchange as an ad, then you would add a nofollow there, and that’s kind of that shady situation that Gary was hinting at.
That I would never take part in.
Yes, I was not hinting.
He was accusing. Outright, accusing!
I was just alluding to it.
Okay, so you’re selling links? So it’s…
Okay, wait, wait, wait, wait, wait, wait, wait, wait. So can I combine noindex? With no follow? And then add a nosnippet as well? And a noarchive? And max image preview?
And Max snippet size? Seven?
Of course you can. But I mean, you’re not going to see anything. So…
Can I do nofollow max image preview? Noarchive, and nosnippet?
Probably. I mean, you can do whatever you want on your last day.
This sounds like a quiz.
You can do whatever you want on your website. It’s just like the question like what search engines do?
Wait. Now I have why did you schedule that one on one? With me? I’m nervous. I will shut up. How many combinations do you think we can have, John?
All robots meta tags? I would say more than seven.
More than 100?
I know the answer.
You know the answer. This is a trick question.
I know that I know the answer. Because John forgot to delete that line from our planning doc.
He also put the answer like directly in text as well. You didn’t even have to click the link to see the answer John…
Did I put the answer in?
And basically, it’s burned into my retina. 869 plus one plus one.
Okay. Combinations of robots meta tags. So you like if you wanted to use all robots meta tags on a website, you would have to make at least 870 pages. Okay. What do you have planned for the weekend?
Oh, I have access to GPT-3. So it takes literally minutes.
Literally minutes to create so many meta tags. Cool.
But why? What if we create new meta tags? Or what if there are new ones that are created? Are is this in the future for?
Yeah, John, tell us about the future.
Will there be new meta tags being added?
I I think that’s always tricky, because then we would have to have more combinations. And then Gary would have to create more pages for his test site.
No but that’s easy. We covered that. A few more minutes? Yes.
I think so. At least in the past what I’ve seen when talking with the leads at Google search, is they don’t really like to have new meta tags, because there’s just so much overhead with everything around meta tags. So as soon as we add something new, we kind of have to promise to support it for a reasonable amount of time. We have to do all the documentation, we have to do all the implementations internally. And if it’s something that is tied to a specific feature, where we don’t know how long that feature will be around, then it’s very much… there’s so many dependencies there that we kind of say, well, we prefer not to have new robots meta tags.
Hey, John, John John. Yes. Do you remember rel=author?
Rel=author? Well, that’s not a robots meta tag. What about rel=next? In general, all of these control mechanisms are super useful when they’re relevant and when they remain valid for a longer period of time. But as soon as you can’t guarantee that, or if you do something that is tied to very, very small feature set, then it’s it’s very complicated to justify kind of all of the effort of doing the work on our side, the documentation, explaining it to everyone externally. Everyone externally then going off and saying, like, oh, yeah, maybe we should implement this. And they plan the implementation for months and months. And then if after a couple of years, we come back and say, well, actually, we turned that feature off, we can turn the robots meta tag off now too, then there’s just so much time and effort invested into doing something that ends up having no long term value. And that’s kind of something we try to avoid, we really want to help people to make something for the long run. And obviously, nothing is forever. So it’s not like we can guarantee that things will stick around forever. But it should be kind of self sustaining and valid for a longer period of time.
Okay, that makes sense. So we might, but like, maybe not, doors open, if the use case proves that it will be useful for a long period of time.
Yeah, I guess we try to avoid it. But it’s not that we will guarantee that we’ll never make new ones because I’m sure there will be new ones. Kind of like like the recent one, which I forgot, like index embedded, I think, where it’s it’s kind of a special use case. But it’s a very kind of important use case and one that we’ve seen a lot. So it’s, it’s reasonable to do something for that.
Yeah, in general. We, we, as in the search relations team, try to push back on new meta tags, but every now and then we get the surprises like index if embedded because they made sense for for something very specific, but important. But right. Also, we don’t see how or we are not going to see a new way to implement videos, for example, basically the in case in that specific case, in case of indexifembedded, there are at least two ways to to provide videos. But when it comes to embedding, then it’s always a frame. And we don’t see that changing. Like it’s been around since forever. There’s no new way to do that. So basically, we just had to come up with something that made sense for for that particular purpose.
So what about other file types? Like like a PDF? What if we put all the recipes into a PDF. Is there a way to block those?
I mean, you have the robots header, like HTTP header.
Oh, yeah. Okay.
But what if I upload the PDF, like on my CMS site, and I don’t have access to that. Tough. Oh.
You can use the removal tool. But but really, you put like, in those cases, if you don’t have access to the to the headers, like you can change the headers, then you probably just don’t want to upload that thing. The PDF.
I guess, robots.txt would also work. But it’s, then you have the difficulty that perhaps if someone linked to it, it could be indexed as a URL.
I mean, robots.txt definitely works for images, for example, or video.
Why does it work there?
Would it work for PDF though? So it would work for an image or video, but not necessarily a PDF?
Right. So we index video and images quite differently than then web content. Like for example, in even in case of PDF, the first step that we take when we try to index it, is to convert it to HTML. And then from there on we treat it as HTML. So in those cases, we index it for web. Like for the web tab, it’s not called web tab anymore, but old tab or whatever. While images and videos are indexed for a different tab, more mode, or whatever you want to call it, maybe we should standardize the names for this. And yeah, it’s just completely different, like different content types.
So why would robots.txt work for images?
Good question. Maybe it shouldn’t.
Is it because in the image search, we would show the image? Like what will we show with a robot image?
Oh, yeah, exactly. Exactly. Yeah.
John, coming to save the day.
Yeah, we wouldn’t have anything to show as a snippet. Well, virtual snippet. Pseudo snippet.
I think we could find the end to this episode, maybe right now.
And that’s it for this episode. Next time on Search Off the Record we’ll be continuing our In the Spotlight series where we talk with someone who inspires us in the SEO community. Barry Schwartz. We’ve been having fun with this podcast. And I hope you the listener have found it both entertaining and insightful, too. Feel free to drop us a note on Twitter at Google search C, or chat with us and one of the next events we go to if you have any thoughts. And of course don’t forget to like and subscribe. Thank you and goodbye.