In this episode of Ask an SEO, Brian goes through the basics of robots.txt and how you can best use this for SEO.
From best practices to syntax and wildcards, Brian takes you through the most common errors and solutions for problems you will face while coding robots.txt.
You can watch the video below:
You can also read the transcript as well:
Ask an SEO Episode 18: What Is a Robots.txt File Transcript
That’s really its main function. So if you have anything that you really don’t want to be ranking for or you don’t want indexed or crawled by Google, then you would put that in the robots.txt file. Now there are a few considerations that Google takes into account for that type of file, so let’s take a look at these and dive right in.
All right. This is the Google Web Developer documentation for robots.txt files. And we’re going to go through this line by line as far as the specific examples are concerned. All right. Let’s hop to it.
So, the first thing that they show really are examples of valid robots.txt URLs. And this is pretty important to go over because the last thing you want to do is not – is to include specific URLs that are not specifically valid, according to these particular guidelines on Google.
Now, it’s important to note that for the most part, when you actually include the path on robots.txt, for example, you’re usually not going to include the full URL. You’re usually going to include a directory structure beginning at wherever you want Google to stop crawling and indexing, right?
And next they talk about the handling of errors and HTTP status codes. And, it’s also important to take this into consideration, because this is something that the…that Google is going to take into account when it’s crawl…when it’s crawling the site right, or any time that Google crawls the site and you have a specific error code that’s showing up with the error code as a result, then that’s going to impact your crawling as well. So it’s important to take these into account. If you don’t, you run the risk of really introducing errors that shouldn’t be there.
For the valid syntax that Google actually wants you to follow: you want to follow the syntax that is outlined here, right? So for a user-agent, allow, disallow and the sitemap directives – these are the most common fields that you’re going to be using in robots.txt. So you’re always most likely going to want to identify the specific user agent.
This is going to be the crawler that you’re going to allow to crawl through the directories on your site. Then you can specifically allow a path to be crawled, but really that it’s mostly redundant where Google is concerned because Google is going to crawl and index anyway.
That’s their default option. You don’t have to really include allow unless you want to be a little bit more anal about it, I guess, for whatever reason. But anyway, disallowing is going to be probably the most used command that you’re going to use in robots.txt.
And finally, you’re also going to want to declare where your sitemap specifically is on your server. This will allow Google to find and index these specific sitemaps on your server properly, rather than relying on random guesses as to where it’s going to be.
And it’s pretty important to follow the Google Web Developer docs because these are going to show you a lot of instances like this, for example, when you have groupings of lines and rules. It will show you formatting, specific rules, and how you want to group those lines in robots.txt.
So this can be a very handy manual to troubleshooting any possible issues that you might come across when it comes to figuring out exactly what’s causing the root of your robot.txt problem. And also, as well, you want to make sure that you follow some of the recommendations on URL matching, which is based on path values.
So any of these path values, for example, the asterisk, you’re going to designate zero or more instances, more instances of any valid character, and then the dollar sign would designate the end of the URL. These are certain placeholders within robots.txt that behave a certain way when you use them.
So if you don’t use them correctly, you can introduce errors in crawling and indexing that way. All right that’s it for today’s Ask an SEO Episode 18. This is Brian Harnish signing off. Please be sure to like and subscribe to our YouTube channel for a brand new episode every week.
Have a great day!