Google Answers: Is It Possible to Control Crawling Through API Requests?

One SEO professional asked John Mueller during a hangout about controlling Google crawling their site via API requests.

Their question was: So they have a live stream shopping platform as their website, and their site currently spends about 20 percent of the crawl budget on the API subdomain, and another 20 percent on image thumbnails of videos.

Neither of the subdomains have content that are part of their SEO strategy. Should they disallow these subdomains from crawling further?

How are the API endpoints discovered and/or used?

John explained that API endpoints end up being used by JavaScript on Google, and they will render your pages.

And if they access an API that’s on a site, then Google will try to load the content from that API and use that for rendering of the page.

Furthermore, depending on how your API is set up and how your JavaScript is set up, it may be that it’s hard for Google to cache those API results. This means that maybe Google crawls a lot of these API requests to try and get a rendered version of your pages so that they can use those for indexing.

So, this is usually the place where this is discovered. And that’s something you can help, by making sure that the API results can also be cached as well.

So you don’t inject any timestamps into URLs.

For example, when you are using JavaScript for the API, all of those things there. If you don’t care about the content that’s returned with these API endpoints, then of course, you can block this subdomain from being crawled with robots.txt.

And that will block all of those API requests from happening. So that’s something where, first of all, you need to figure out whether these API results are part of the primary content or is it important critical content that you want to have indexed by Google?

And if so, then perhaps you should not block crawling. But if this is something where it’s generating something that is almost secondary to your pages, or anything that’s not critical for your pages themselves, then it might be worthwhile to double check what it looks like when they’re blocked.

John Mueller Hangout Transcript

John (Question)
Okay, next question I have here is from Van Song. Our site is a live stream shopping platform. Our site currently spends about 20% of the crawl budget on the API subdomain, another 20% on image thumbnails of videos. Neither of these subdomains have content which is part of our SEO strategy. Should we disallow these subdomains from crawling? Or how are the API endpoints discovered or used?

John (Answer)
So maybe the last question there first. In many cases, API endpoints end up being used by JavaScript on our website, and we will render your pages. And if they access an API that is on your website, then we’ll try to kind of load the content from that API and use that for rendering of the page. And depending on how your API is set up, and how your JavaScript is set up, it might be that it’s hard for us to cache those API results, which means that maybe we crawl a lot of these API requests to try to get a rendered version of your pages so that we can use those for indexing. So that’s usually the place where this is discovered.

And that’s something you can help by making sure that the API results can also be cached well, that you don’t inject any timestamps into URLs. For example, when you’re using JavaScript for the API, all of those things there. If you don’t care about the content that’s returned with these API endpoints. Then, of course, you can block this whole subdomain from being crawled with the robots.txt file. And that will essentially block all of those API requests from happening. So that’s something where you, you kind of first of all need to figure out are these API results actually, like part of the primary content or important critical content that I want to have indexed from Google?

And if so, then probably you should not block crawling. But if this is something where it’s essentially generating something that is almost secondary to your pages, or anything that’s not critical for your pages themselves, then it might be worthwhile to double check what it looks like when they’re blocked. And one way you could double check this is if you could create a separate test page that doesn’t call the API or that uses a broken URL for the API endpoint. And by that you can see like, how does this page actually render in my browser? How does it render for Google? So those are the things that I would look at there.

John Mueller Hangout Transcript

Brian Harnish

Recent Articles