How to set up Robots.txt correctly?

Table of contents:

How to set up Robots.txt correctly?
How to set up Robots.txt correctly?
Anonim

The correct Robots txt for the html site creates action mockups for search engine bots, telling them what they can check. This file is often referred to as the Robot Exclusion Protocol. The first thing bots look for before crawling a website is robots.txt. It can point to or tell the Sitemap not to check certain subdomains. When you want search engines to search for what is most frequently found, then robots.txt is not required. It is very important in this process that the file is formatted correctly and does not index the user page with the user's personal data.

Robot scanning principle

The principle of robot scanning
The principle of robot scanning

When a search engine encounters a file and sees a banned URL, it doesn't crawl it, but it can index it. This is because even if the robots are not allowed to view the content, they can remember backlinks pointing to the forbidden URL. Due to blocked access to the link, the URL will appear in search engines, but without fragments. If afor the incoming marketing strategy, the correct Robots txt for bitrix (Bitrix) is required, they provide site verification at the request of the user by scanners.

On the other hand, if the file is not properly formatted, this can result in the site not showing up in search results and not being found. Search engines cannot bypass this file. A programmer can view the robots.txt of any site by going to its domain and following it with robots.txt, such as www.domain.com/robots.txt. Using a tool like Unamo's SEO optimization section, where you can enter any domain, and the service will show information about the existence of the file.

Restrictions for scanning:

  1. User has outdated or sensitive content.
  2. Images on the site will not be included in image search results.
  3. The site is not yet ready for demo to be indexed by the robot.

Keep in mind that the information a user wishes to receive from a search engine is available to anyone who enters the URL. Do not use this text file to hide sensitive data. If the domain has a 404 (not found) or 410 (passed) error, the search engine checks the site despite the presence of robots.txt, in which case it considers that the file is missing. Other errors such as 500 (Internal Server Error), 403 (Forbidden), timed out, or "not available" respect robots.txt instructions, however bypass can be delayed until the file is available.

Creating a search file

Creating a search file
Creating a search file

ManyCMS programs such as WordPress already have a robots.txt file. Before properly configuring Robots txt WordPress, the user needs to familiarize themselves with its capabilities in order to figure out how to access it. If the programmer creates the file himself, it must meet the following conditions:

  1. Must be in lower case.
  2. Use UTF-8 encoding.
  3. Save in a text editor as a file (.txt).

When a user doesn't know where to place it, they contact the web server software vendor to find out how to access the root of a domain or go to the Google console and download it. With this function, Google can also check if the bot is functioning correctly and the list of sites that have been blocked using the file.

The main format of the correct Robots txt for bitrix (Bitrix):

  1. Legend robots.txt.
  2. , adds comments that are used as notes only.
  3. These comments will be ignored by scanners along with any user typos.
  4. User-agent - indicates which search engine the instructions for the file are listed on.
  5. Adding an asterisk () tells scanners that the instructions are for everyone.

Indicating a specific bot, for example, Googlebot, Baiduspider, Applebot. Disallow tells crawlers which parts of the website should not be crawled. It looks like this: User-agent:. The asterisk means "all bots". However, you can specify pages for specificbots. To do this, you need to know the name of the bot for which recommendations are set.

The correct robots txt for Yandex might look like this:

Correct robots txt for Yandex
Correct robots txt for Yandex

If the bot should not crawl the site, you can specify it, and to find the names of user agents, it is recommended to familiarize yourself with the online capabilities of useragentstring.com.

Page optimization

Page optimization
Page optimization

The following two lines are considered a complete robots.txt file, and a single robots file can contain multiple lines of user agents and directives that disable or enable crawling. The main format of the correct Robots txt:

  1. User agent: [agent username].
  2. Disallow: .

In the file, each block of directives is displayed as discrete, separated by a line. In the file next to the agent user directory, each rule is applied to a specific set of section-separated lines. If a file has a multi-agent rule, the robot will only consider the most specific group of instructions.

Technical syntax

Technical Syntax
Technical Syntax

It can be thought of as the "language" of robots.txt files. There are five terms that can exist in this format, the main ones include:

  1. User-agent - Web crawler with crawl instructions, usually a search engine.
  2. Disallow - command used to tell the user agent to bypass(omission) of a specific URL. There is only one forbidden condition for each.
  3. Allow. For the Googlebot that gets access, even the user page is denied.
  4. Crawl-delay - Specifies how many seconds the crawler will need before crawling. When the bot does not confirm it, the speed is set in the Google console.
  5. Sitemap - Used to locate any XML maps associated with a URL.

Pattern Matches

When it comes to actually blocking URLs or allowing valid Robots txt, the operations can be quite tricky as they allow you to use pattern matching to cover a number of possible URL parameters. Google and Bing both use two characters that identify pages or subfolders that the SEO wants to exclude. The two characters are the asterisk () and the dollar sign ($), where:is a wildcard that represents any sequence of characters. $ - matches the end of the URL.

Google offers a large list of possible template syntaxes that explain to the user how to properly set up a Robots txt file. Some common use cases include:

  1. Prevent duplicate content from appearing in search results.
  2. Keep all sections of the website private.
  3. Save internal pages of search results based on open statement.
  4. Indicate location.
  5. Prevent search engines from indexing certainfiles.
  6. Specifying a crawl delay to stop reloading when scanning multiple content areas at the same time.

Checking for the presence of a robot file

If there are no areas on the site that need to be crawled, then robots.txt is not needed at all. If the user is not sure that this file exists, he needs to enter the root domain and type it at the end of the URL, something like this: moz.com/robots.txt. A number of search bots ignore these files. However, as a rule, these crawlers do not belong to reputable search engines. They are the kind of spammers, mail aggregators and other types of automated bots that are found in abundance on the Internet.

It is very important to remember that using the robot exclusion standard is not an effective security measure. In fact, some bots may start with pages where the user sets the crawl mode for them. There are several parts that go into the standard exception file. Before you tell the robot which pages it should not work on, you need to specify which robot to talk to. In most cases, the user will use a simple declaration that means "all bots".

SEO optimization

SEO optimization
SEO optimization

Before optimizing, the user must make sure that he does not block any content or sections of the site that need to be bypassed. Links to pages blocked by the correct Robots txt will not be respected. This means:

  1. If they are not linked to other pages available to search engines ie. pages,not blocked by robots.txt or a meta robot, and related resources will not be crawled and therefore cannot be indexed.
  2. No link can be passed from a blocked page to the link destination. If there is such a page, it is better to use a different blocking mechanism than robots.txt.

Because other pages may directly link to a page containing personal information and you want to block this page from search results, use a different method, such as password protection or noindex meta data. Some search engines have multiple user agents. For example, Google uses Googlebot for organic searches and Googlebot-Image for image searches.

Most user agents from the same search engine follow the same rules, so there is no need to specify directives for each of several crawlers, but being able to do so can fine-tune the crawling of site content. The search engine caches the contents of the file, and typically updates the cached contents at least once a day. If the user changes the file and wants to update it faster than usual, they can submit the robots.txt URL to Google.

Search engines

Checking for the existence of a robot file
Checking for the existence of a robot file

To understand how Robots txt works correctly, you need to know about the capabilities of search engines. In short, their ability lies in the fact that they send "scanners", which are programs thatbrowsing the Internet for information. They then store some of this information to later pass it on to the user.

For many people, Google is already the Internet. In fact, they are right, since this is perhaps his most important invention. And although search engines have changed a lot since their inception, the underlying principles are still the same. Crawlers, also known as "bots" or "spiders", find pages from billions of websites. Search engines give them directions on where to go, while individual sites can also communicate with bots and tell them which specific pages they should look at.

Generally, site owners don't want to show up in search engines: admin pages, backend portals, categories and tags, and other information pages. The robots.txt file can also be used to prevent search engines from checking pages. In short, robots.txt tells web crawlers what to do.

Ban Pages

This is the main part of the robot exclusion file. With a simple declaration, the user tells a bot or group of bots not to crawl certain pages. The syntax is simple, for example, to deny access to everything in the site's "admin" directory, write: Disallow: /admin. This line will prevent bots from crawling yoursite.com/admin, yoursite.com/admin/login, yoursite.com/admin/files/secret.html, and anything else under the admin directory.

To disallow one page, simply specify it in the disallow line: Disallow: /public/exception.html. Now the "exception" pagewill not migrate, but everything else in the "public" folder will.

To include multiple pages, simply list them:

Directories and pages
Directories and pages

These four lines of the correct Robots txt for symphony will apply to any user agent listed at the top of therobots.txt section for

Ban pages
Ban pages

Sitemap:

Other commands:live - do not allow web crawlers to index cpresources/ or provider/.

User Agent:Disallow: /cpresources/.

Deny: / vendor / Disallow: /.env.

Setting standards

User can specify specific pages for different bots by combining the previous two elements, this is what it looks like. An example of the correct Robots txt for all search engines is presented below.

Setting Standards
Setting Standards

The "admin" and "private" sections will be invisible to Google and Bing, but Google will still see the "secret" directory, while Bing will not. You can specify general rules for all bots using the asterisk user agent, and then give specific instructions to the bots in the following sections. With the knowledge above, the user can write an example of the correct Robots txt for all search engines. Just fire up your favorite text editor and tell the bots they're not welcome in certain parts of the site.

Tips for improving server performance

SublimeText isa versatile text editor and the gold standard for many programmers. His programming tips are based on efficient coding, moreover. users appreciate the presence of shortcuts in the program. If the user wants to see an example of a robots.txt file, they should go to any site and add "/robots.txt" to the end. Here is part of the robots.txt file GiantBicycles.

The program provides the creation of pages that users do not want to show in search engines. And also has a few exclusive things that few people know about. For example, while the robots.txt file tells bots where not to go, the sitemap file does the opposite and helps them find what they are looking for, and while search engines probably already know where the sitemap is located, it doesn't get in the way.

There are two types of files: HTML page or XML file. An HTML page is one that shows visitors all the available pages on a website. In its own robots.txt it looks like this: Sitemap://www.makeuseof.com/sitemap_index.xml. If the site is not indexed by search engines, although it has been crawled several times by web robots, you need to make sure that the file is present and that its permissions are set correctly.

By default, this will happen to all SeoToaster installations, but if necessary, you can reset it like this: File robots.txt - 644. Depending on the PHP server, if this does not work for the user, it is recommended to try the following: File robots.txt - 666.

Setting the scan delay

The bypass delay directive informs certainsearch engines how often they can index a page on the site. It is measured in seconds, although some search engines interpret it slightly differently. Some people see crawl delay 5 when they are told to wait five seconds after each scan to start the next one.

Others interpret this as an instruction to only scan one page every five seconds. The robot cannot scan faster to conserve server bandwidth. If the server needs to match the traffic, it can set a bypass delay. In general, in most cases, users do not need to worry about this. This is how the crawl delay of eight seconds is set - Crawl-delay: 8.

But not all search engines will obey this directive, so when disallowing pages, you can set different crawl delays for certain search engines. After all the instructions in the file are set up, you can upload it to the site, first make sure that it is a simple text file and has the name robots.txt and can be found at yoursite.com/robots.txt.

Best WordPress bot

Best WordPress Bot
Best WordPress Bot

There are some files and directories on a WordPress site that need to be locked every time. The directories that users should disallow are the cgi-bin directory and the standard WP directories. Some servers do not allow access to the cgi-bin directory, but users must include it in the disallow directive before properly configuring Robots txt WordPress

Standard WordPress directories,which should block are wp-admin, wp-content, wp-includes. These directories do not contain data that is initially useful to search engines, but there is an exception, i.e. there is a subdirectory named uploads in the wp-content directory. This subdirectory must be allowed in the robot.txt file as it includes everything that is loaded using the WP media upload feature. WordPress uses tags or categories to structure content.

If categories are used, then in order to make the correct Robots txt for Wordpress, as specified by the program manufacturer, it is necessary to block the tag archives from the search. First, they check the database by going to the "Administration" panel> "Settings"> "Permalink".

By default, the base is the tag, if the field is empty: Disallow: / tag /. If a category is used, then you must disable the category in the robot.txt file: Disallow: /category/. By default, the base is the tag, if the field is empty: Disallow: / tag /. If a category is used, then you must disable the category in the robot.txt file: Disallow: / category /.

Files used primarily for displaying content, they will be blocked by the correct Robots txt file for Wordpress:

Robots txt for wordpress
Robots txt for wordpress

Joomla basic setup

Once the user has installed Joomla, you need to view the correct Joomla Robots txt setting in the global configuration, which is located in the control panel. Some settings here are very important for SEO. First find the name of the site and make sure thatthe short name of the site is used. Then they find a group of settings to the right of the same screen, which is called SEO settings. The one that will definitely have to change is the second one: use a rewrite URL.

This sounds complicated, but it basically helps Joomla create cleaner URLs. Most noticeable if you remove the index.php line from the URLs. If you change it later, the URLs will change and Google won't like it. However, when changing this setting, several steps must be taken at the same time to create the correct robots txt for Joomla:

  1. Find htaccess.txt file in Joomla root folder.
  2. Mark it as.htaccess (no extension).
  3. Include site name in page titles.
  4. Find metadata settings at the bottom of the global configuration screen.

Robot in the cloud MODX

Robot in the MODX Cloud
Robot in the MODX Cloud

Previously, MODX Cloud provided users with the ability to control the behavior of enabling the robots.txt file for maintenance based on a toggle in the dashboard. While this was useful, it was possible to accidentally allow indexing on staging/dev sites by toggling an option in the Dashboard. Similarly, it was easy to disable indexing on the production site.

Today the service assumes the presence of robots.txt files in the file system with the following exception: any domain that ends with modxcloud.com will serve as a Disallow: /directive for all user agents, regardless of the presence or absence of the file. Production sites that receive real visitor traffic will need to use their own domain if the user wants to index their site.

Some organizations use the correct Robots txt for modx to run multiple websites from a single installation using Contexts. A case in which this could be applied would be a public marketing site combined with landing page micro sites and possibly a non-public intranet.

Traditionally this has been difficult to do for multi-user installations as they share the same network root. With MODX Cloud, this is easy. Simply upload an extra file to a website called robots-intranet.example.com.txt with the following content and it will block indexing with well-working robots and all other hostnames fall back to standard files unless there are other specific name nodes.

Robots.txt is an important file that helps the user to link to the site on Google, major search engines and other websites. Located at the root of a web server, the file instructs web robots to crawl a site, set which folders it should or should not index, using a set of instructions called the Bot Exclusion Protocol. An example of the correct Robots txt for all search engines obots.txt is especially easy to do with SeoToaster. A special menu has been created for it in the control panel, so the bot will never have to overwork to gain access.

Recommended: