The work of an SEO-optimizer is very large-scale. Beginners are advised to write down the optimization algorithm so as not to miss any steps. Otherwise, the promotion will hardly be called successful, since the site will constantly experience failures and errors that will have to be corrected for a long time.
One of the optimization steps is working with the robots.txt file. Every resource should have this document, because without it it will be more difficult to cope with optimization. It performs many functions that you will have to understand.
Robot Assistant
The robots.txt file is a plain text document that can be viewed in the standard Notepad of the system. When creating it, you must set the encoding to UTF-8 so that it can be read correctly. The file works with http, https and FTP protocols.
This document is an assistant to search robots. In case you don't know, every system uses "spiders" that quickly crawl the World Wide Web to return relevant sites for queries.users. These robots must have access to the resource data, robots.txt works for this.
In order for the spiders to find their way, you need to send the robots.txt document to the root directory. To check if the site has this file, enter “https://site.com.ua/robots.txt” into the address bar of the browser. Instead of "site.com.ua" you need to enter the resource you need.
Document functions
The robots.txt file provides crawlers with several types of information. It can give partial access so that the "spider" can scan specific elements of the resource. Full access allows you to check all available pages. A complete ban prevents robots from even starting to check, and they leave the site.
After visiting the resource, "spiders" receive an appropriate response to the request. There may be several of them, it all depends on the information in robots.txt. For example, if the scan was successful, the robot will receive the code 2xx.
Perhaps the site has been redirected from one page to another. In this case, the robot receives the code 3xx. If this code occurs multiple times, then the spider will follow it until it receives another response. Although, as a rule, he uses only 5 attempts. Otherwise, the popular 404 error appears.
If the answer is 4xx, then the robot is allowed to crawl the entire content of the site. But in the case of the 5xx code, the check may stop completely, since this often indicates temporary server errors.
What forneed robots.txt?
As you may have guessed, this file is the robots' guide to the root of the site. Now it is used to partially restrict access to inappropriate content:
- pages with personal information of users;
- mirror sites;
- search results;
- data submission forms, etc.
If there is no robots.txt file in the site root, the robot will crawl absolutely all content. Accordingly, unwanted data may appear in the search results, which means that both you and the site will suffer. If there are special instructions in the robots.txt document, then the "spider" will follow them and give out the information desired by the owner of the resource.
Working with a file
To close a site from indexing using robots.txt, you need to figure out how to create this file. To do this, follow the instructions:
- Create a document in Notepad or Notepad++.
- Set the file extension ".txt".
- Enter the required data and commands.
- Save the document and upload it to the site root.
As you can see, at one of the stages it is necessary to set commands for robots. They are of two types: allowing (Allow) and prohibiting (Disallow). Also, some optimizers may specify crawl speed, host, and link to resource page map.
In order to start working with robots.txt and completely close the site from indexing, you must also understand the symbols used. For example, in a documentuse "/", which indicates that the entire site is selected. If "" is used, then a sequence of characters is required. In this way, it will be possible to specify a specific folder that can either be scanned or not.
Feature of bots
"Spiders" for search engines are different, so if you work for several search engines at once, then you will have to take this moment into account. Their names are different, which means that if you want to contact a specific robot, you will have to specify its name: “User Agent: Yandex” (without quotes).
If you want to set directives for all search engines, then you need to use the command: "User Agent: " (without quotes). In order to properly block the site from indexing using robots.txt, you need to know the specifics of popular search engines.
The fact is that the most popular search engines Yandex and Google have several bots. Each of them has its own tasks. For example, Yandex Bot and Googlebot are the main "spiders" that crawl the site. Knowing all the bots, it will be easier to fine-tune the indexing of your resource.
Examples
So, with the help of robots.txt, you can close the site from indexing with simple commands, the main thing is to understand what you need specifically. For example, if you want Googlebot not to approach your resource, you need to give it the appropriate command. It will look like: "User-agent: Googlebot Disallow: /" (without quotes).
Now we need to understand what is in this command and how it works. So "User-agent"is used in order to use a direct call to one of the bots. Next, we indicate to which one, in our case it is Google. The "Disallow" command must start on a new line and prohibit the robot from entering the site. The slash symbol in this case indicates that all pages of the resource are selected for the command execution.
In robots.txt, you can disable indexing for all search engines with a simple command: "User-agent:Disallow: /" (without quotes). The asterisk character in this case denotes all search robots. Typically, such a command is needed in order to pause the indexing of the site and start cardinal work on it, which otherwise could affect the optimization.
If the resource is large and has many pages, it often contains proprietary information that is either undesirable to disclose, or it can negatively affect promotion. In this case, you need to understand how to close the page from indexing in robots.txt.
You can hide either a folder or a file. In the first case, you need to start again by contacting a specific bot or everyone, so we use the “User-agent” command, and below we specify the “Disallow” command for a specific folder. It will look like this: "Disallow: / folder /" (without quotes). This way you hide the entire folder. If it contains some important file that you would like to show, then you need to write the command below: “Allow: /folder/file.php” (without quotes).
Check file
If using robots.txt to close the site fromYou succeeded in indexing, but you don’t know if all your directives worked correctly, you can check the correctness of the work.
First, you need to check the placement of the document again. Remember that it must be exclusively in the root folder. If it is in the root folder, then it will not work. Next, open the browser and enter the following address there: “https://yoursite. com/robots.txt (without quotes). If you get an error in your web browser, then the file is not where it should be.
Directives can be checked in special tools that are used by almost all webmasters. We are talking about Google and Yandex products. For example, Google Search Console has a toolbar where you need to open "Crawl", and then run the "Robots.txt File Inspection Tool". You need to copy all the data from the document into the window and start scanning. Exactly the same check can be done in Yandex. Webmaster.