One SEO professional asked John Mueller during a hangout about indexing images, and robots.txt requirements for doing so.
So this SEO pro runs a recipe website. In terms of rich recipe results, they have hundreds of thousands of recipes that are indexed.
They have lots of traffic that is coming from the recipe gallery. Then, suddenly, over a period of time, it stopped.
All of the metadata checked out. Google Search Console also said that, “Yep. This is all rich recipe content. It’s all good. It can be shown.”
They finally noticed in the preview, when you preview a result, that the image was missing. And it appears to be due to a change at Google.
The change appeared to be that if a robots.txt file was required in order for images to be retrieved, then nothing they could see in the tools was actually saying anything was invalid.
And so, it’s a bit awkward, right?
When you check something to say – “Is this a valid recipe result?” And it says “Yeah.” It’s great, it’s absolutely great.
However, it turns out that behind the scenes, there was a new requirement that you have to have a robots.txt file.
John asked the SEO pro what they mean that they have to have a robots.txt file.
The SEO pro stated that what they found is, if they request the robots.txt from their CDN, it will give them a 500 error.
When they put a robots.txt there, immediately the previews started appearing correctly, and this involves crawling and putting on the static site. Operationally, they found that adding the robots.txt file is what did the job in terms of repairing the indexing issues.
John explained that from Google’s perspective, it’s not that a robots.txt file is required, but a robots.txt file has to have the proper result code.
So if you don’t have one, it should return a 404 error.
If you do have one, then Google can obviously read that.
But, if you return a server error for the robots.txt file, then their systems will assume that perhaps there’s an issue with a server and they won’t crawl.
And that’s something that’s been like that since the beginning.
However, these types of issues where, especially when you’re on a CDN, and it’s on a separate hostname, sometimes that’s really hard to spot.
And John imagines the rich results test, as far as he knows, focuses on the content that’s on the HTML page.
So the JSON-LD markup that they have, it probably doesn’t check to see if the images are actually fetchable on the server. If they can’t be fetched, then of course they cannot use them on the carousel.
So, this is something that Google may need to figure out better.
This happens at approximately the 51:45 mark in the video.
John Mueller Hangout Transcript
SEO Professional 9 51:45
Hi John! Oh, we ran into a bit of a tiger trap, I would say, in terms of rich recipe results. We have hundreds of thousands of recipes, which are indexed, and they have lots of traffic coming through from the recipe gallery. And then suddenly, or not, suddenly, over a period of time it stopped. And all of the metadata checked out. And Google was…the Search Console was saying, Yep, this is all rich recipe content. It’s all good. It can be shown.
We finally noticed that in the preview, when you preview the result, the image was missing. And it seems that there was a change at Google. And that if a robots.txt was required in order for images to be retrieved, then nothing we could see in the tools was actually saying anything was invalid.
And so it’s a bit awkward, right? When you check something to say, is this a valid rich recipe result? And it says, yeah, it’s great. It’s absolutely great. We’ve got all the metadata, and you check all the URLs and all the images are right. But it turns out behind the scenes, there was a new requirement that you have a robots.txt.
John 52:48
How do you mean that you had to have a robots.txt?
SEO Professional 9 52:51
What we found is, if you requested the robots.txt from our CDN, it gave you like, a 500. When we put a robots.txt there, immediately the previews started appearing correctly, and that involves crawling and putting onto the static site, I think. Operationally, we found that adding that robots.txt did the job.
John 53:15
Yeah. Okay. So from our point of view, it’s not that a robots.txt file is required, but it has to have a proper result code. So if you don’t have one, it should return 404. If you do have one, then we can obviously read that. But if you return a server error for the robots.txt file, then our systems will assume that maybe there’s an issue with the server and we won’t crawl it. And that’s kind of something that’s been like that since the beginning.
But these kinds of issues where, especially when you’re on a CDN, and it’s on a separate hostname, sometimes that’s really hard to spot. And I imagine the rich results test, at least as far as I know, it focuses on the content that is on the HTML page. So the JSON-LD markup that you have there, it probably doesn’t check to see if the images are actually fetchable.
And then if they can’t be fetched, then, of course, we can’t use them in the carousel, too. So that might be something that we need to figure out how to highlight better.
SEO Professional 9 54:21
Yeah, if there was some way for it to show up that something couldn’t be retrieved because I think across Search Console, everything was happy. Uh, yeah, that was a challenge. And in fact, the pages were coming up everywhere else, and were highly ranked. So it’s just a difficult one.
If there’s a way to say, Yeah, we couldn’t harvest something. These images couldn’t be harvested. That will be a valuable thing.
John 54:41
Yeah. Okay. Cool. I’ll pass that on.