SEO – Robots.txt
A robots.txt is a text file that contains instructions for Search Engine Crawlers that which pages can be accessed or crawled to display in search engine results.
The robots.txt file specifies the Robots Exclusion Protocol (REP) for a website. Robots play an important role when you want to block search engines (or public users) to access your website content.
Most Search Engine Crawlers (like GoogleBot, Bingbot) follow your robots file but some (bad) web robots may not. Therefore, you should not completely rely on robots.txt file to block private or confidential data on the website. It is important to learn the use of Robots.txt file while working on Search Engine Optimization (SEO) guidelines for your websites.
How does robots.txt file work?
Search Engine first performs Crawling to discover web pages and then perform Indexing to include them in search engine results. During crawling, search engines follow links on the website to discover other pages, therefore able to crawl over 1 billion websites.
Before crawling a website, Search engine bots check robots.txt file. If your robots.txt file allows everything, the search engine will crawl all pages. If you disallow all or certain pages, it will not crawl those pages. The robots.txt file specifies crawling instruction and does not control indexing.
Tips & Instructions while creating robots.txt:
- robots.txt file needs to be placed in a website’s root i.e. top-level directory and publically accessible.
- Example 1: https://www.example.com/robots.txt
- Example 2: https://subdomain.example.com/robots.txt
- File robots.txt name must be in lower case.
- Search engines may cache robots.txt file for some time (Google cache up to one day).
- If you want Google to fetch the latest robots.txt file instantly you can try to ‘Submit’ using Google Robots Testing Tool in Search Console.
- The maximum file size limit for robots.txt is 500 kb by Google. This is enough for most websites. Still, if you have thousands of URLs, you need to specify them intelligently.
- If the robots.txt file does not exist or does not contain any disallow rules, it will crawl all information on the site.
- Each instruction to allow or disallow access is case-sensitive. Therefore, specify the exact folder/file names.
- It is recommended to include a sitemap file location at the bottom of the robots.txt file. It helps crawlers to locate all your website pages faster.
Syntax of Robots file:
You can specify different elements such as User-agent, Allow, Disallow, Crawl-delay, and Sitemap in robots.txt. Here is a simple structure or syntax for file:
User-agent: [Web crawler or Robot name] Disallow: [File/Folders URL that not to be crawled] Allow: [File/Folder URL that to be crawled]
|User-agent||Specify web crawler or bots|
|Allow||Specify page, file or folder which is allowed to access|
|Disallow||Specify page, file or folder which is not allowed to access|
|Crawl-delay||Tell Time (in seconds) to wait for a crawler before start crawling a page. Google does not follow this but you can change its setting in the Search Console.|
|Sitemap||Specify URL of your website sitemap file. This is supported by limited bots such as Google, Bing, Ask and Yahoo.|
Note: Write each element in the title case (First letter capital only). A line started with hash (#) symbols will be treated as a comment only.
List of Popular User agents (Bots):
- Google: Googlebot
- Bing: Bingbot
- Google Images: Googlebot-Image
- Yahoo: Slurp
- Baidu: Baiduspider
You can found List of Popular Web Crawlers and User Agents here.
Example 1: Allow all search engines to crawl everything.
User-agent: * Disallow:
Here, the asterisk/star (*) symbol is used to instruct all search engines. As disallow is empty, all content will be allowed to crawl for all search engines.
Example 2: Block all access for all search engines
If you want to restrict your website access from the public (via search engines), you can block all search engines.
User-agent: * Disallow: /
Here, the Slash (/) symbol is used to specify all folders, subfolder and files.
Example 3: Block selected file/folder for all search engines
Suppose, you do not want search engines to crawl certain file (private.html) and folder (images) in Search Engine Result Pages (SERP). You can use multiple Disallow for each file and folder that needs to be excluded.
User-agent: * Disallow: /images/ Disallow: /private.html
Example 4: Multiple Rules
Here, we have blocked ‘my-folder’ for google search engine, blocked everything for Google images, and allowed everything for Bing search engine.
User-agent: googlebot Disallow: /my-folder/ Allow: / User-agent: Googlebot-Image Disallow: / User-agent: Bingbot Allow: / Sitemap: https://www.domain.com/sitemap.xml
- Google Help Document: https://support.google.com/webmasters/answer/6062596?hl=en
- Google Search Console – Robots file testing – https://www.google.com/webmasters/tools/robots-testing-tool
- Ryte Robots.txt Test Tool – https://en.ryte.com/free-tools/robots-txt/
|Exercises & Assignments|
|No Content Found.|
|Interview Questions & Answers|
|No Content Found.|