Over on Twitter, an SEO professional asked John Mueller about predictive crawling, and whether or not Google uses it for crawling and indexing.
Their question was:
John: do you know if Google uses predictive crawling in Google web search? They have been hearing a lot about it since the problem with indexing of some pages.
Is it currently used in production by Google to predict the quality of a page/site?
John replied that he doesn’t know what they would consider predictive crawling, but Google does have documentation on crawling and perhaps it’s what they would be looking for.
In short, the document doesn’t mention anything about predictive crawling, but this is how Google crawls the site:
General theory of crawling
The web is a nearly infinite space, exceeding Google’s ability to explore and index every available URL. As a result, there are limits to how much time Googlebot can spend crawling any single site. The amount of time and resources that Google devotes to crawling a site is commonly called the site’s crawl budget. Note that not everything crawled on your site will necessarily be indexed; each page must be evaluated, consolidated, and assessed to determine whether it will be indexed after it has been crawled.
Crawl budget is determined by two main elements: crawl capacity limit and crawl demand.
Crawl capacity limit
Googlebot wants to crawl your site without overwhelming your servers. To prevent this, Googlebot calculates a crawl capacity limit, which is the maximum number of simultaneous parallel connections that Googlebot can use to crawl a site, as well as the time delay between fetches. This is calculated to provide coverage of all your important content without overloading your servers.
The crawl capacity limit can go up and down based on a few factors:
- Crawl health: If the site responds quickly for a while, the limit goes up, meaning more connections can be used to crawl. If the site slows down or responds with server errors, the limit goes down and Googlebot crawls less.
- Limit set by site owner in Search Console: Website owners can optionally reduce Googlebot’s crawling of their site. Note that setting higher limits won’t automatically increase crawling.
- Google’s crawling limits: Google has a lot of machines, but not infinite machines. We still need to make choices with the resources that we have.
Crawl demand
Google typically spends as much time as necessary crawling a site, given its size, update frequency, page quality, and relevance, compared to other sites.
The factors that play a significant role in determining crawl demand are:
Perceived inventory: Without guidance from you, Googlebot will try to crawl all or most of the URLs that it knows about on your site. If many of these URLs are duplicates, or you don’t want them crawled for some other reason (removed, unimportant, and so on), this wastes a lot of Google crawling time on your site. This is the factor that you can positively control the most.
Popularity: URLs that are more popular on the Internet tend to be crawled more often to keep them fresher in our index.
Staleness: Our systems want to recrawl documents frequently enough to pick up any changes.
Additionally, site-wide events like site moves may trigger an increase in crawl demand in order to reindex the content under the new URLs.
In sum
Taking crawl capacity and crawl demand together, Google defines a site’s crawl budget as the set of URLs that Googlebot can and wants to crawl. Even if the crawl capacity limit isn’t reached, if crawl demand is low, Googlebot will crawl your site less.
I don't know what you'd consider predictive crawling, but we do have this doc on some of the parts of crawling: https://t.co/32rirc1JJi — maybe it overlaps with what you're looking for?
— 🐝 johnmu.xml (personal) 🐝 (@JohnMu) June 1, 2022
We don't crawl all of the web, so I'm guessing / hoping that over the course of 20+ years, folks have worked on finding ways to focus the crawling on things that matter. The doc on "crawl budget" is essentially the same topic.
— 🐝 johnmu.xml (personal) 🐝 (@JohnMu) June 1, 2022