What is a robots.txt file?

Robots.txt is a text file webmasters create lớn instruct website robots (typically tìm kiếm engine robots) how khổng lồ crawl pages on their website. The robots.txt tệp tin is part of the the robots exclusion protocol (REP), a group of website standards that regulate how robots crawl the website, access và index nội dung, và serve that nội dung up to users. The REPhường also includes directives lượt thích meta robots, as well as page-, subdirectory-, or site-wide instructions for how tìm kiếm engines should treat liên kết (such as “follow” or “nofollow”).

Bạn đang xem: Google robots

In practice, robots.txt files indicate whether certain user agents (web-crawling software) can or cannot crawl parts of a website. These crawl instructions are specified by “disallowing” or “allowing” the behavior of certain (or all) user agents.

Basic format:

User-agent: Disallow: Together, these two lines are considered a complete robots.txt file — though one robots tệp tin can contain multiple lines of user agents và directives (i.e., disallows, allows, crawl-delays, etc.).

Within a robots.txt tệp tin, each phối of user-agent directives appear as a discrete set, separated by a line break:

*

Msnbot, discobot, & Slurp are all called out specifically, so those user-agents will only pay attention to the directives in their sections of the robots.txt file. All other user-agents will follow the directives in the user-agent: * group.

Example robots.txt:

Here are a few examples of robots.txt in action for a www.example.com site:

Robots.txt tệp tin URL: www.example.com/robots.txtBlocking all web crawlers from all content

User-agent: * Disallow: /Using this syntax in a robots.txt file would tell all website crawlers not to crawl any pages on www.example.com, including the homepage.

Allowing all website crawlers access khổng lồ all content

User-agent: * Disallow: Using this syntax in a robots.txt file tells web crawlers to crawl all pages onwww.example.com, including the homepage.

Blocking a specific web crawler from a specific folder

User-agent: Googlebot Disallow: /example-subfolder/This syntax tells only Google’s crawler (user-agent name Googlebot) not lớn crawl any pages that contain the URL string www.example.com/example-subfolder/.

Blocking a specific web crawler from a specific website page

User-agent: Bingbot Disallow: /example-subfolder/blocked-page.htmlThis syntax tells only Bing’s crawler (user-agent name Bing) to avoid crawling the specific page at www.example.com/example-subfolder/blocked-page.html.

How does robots.txt work?

Search engines have two main jobs:

Crawling the website khổng lồ discover content;Indexing that content so that it can be served up to searchers who are looking for information.

To crawl sites, search engines follow liên kết to lớn get from one site khổng lồ another — ultimately, crawling across many billions of liên kết and websites. This crawling behavior is sometimes known as “spidering.”

After arriving at a trang web but before spidering it, the tìm kiếm crawler will look for a robots.txt file. If it finds one, the crawler will read that tệp tin first before continuing through the page. Because the robots.txt tệp tin contains information about how the search engine should crawl, the information found there will instruct further crawler action on this particular site. If the robots.txt tệp tin does not contain any directives that disallow a user-agent’s activity (or if the site doesn’t have a robots.txt file), it will proceed to crawl other information on the site.

Other quiông xã robots.txt must-knows:

(discussed in more detail below)

In order to be found, a robots.txt file must be placed in a website’s top-level directory.

Robots.txt is case sensitive: the file must be named “robots.txt” (not Robots.txt, robots.TXT, or otherwise).

Some user agents (robots) may choose khổng lồ ignore your robots.txt tệp tin. This is especially comtháng with more nefarious crawlers lượt thích malware robots or email address scrapers.

The /robots.txt file is a publicly available: just add /robots.txt to the end of any root domain name khổng lồ see that website’s directives (if that site has a robots.txt file!). This means that anyone can see what pages you vì or don’t want to be crawled, so don’t use them to hide private user information.

Each subdomain name on a root domain uses separate robots.txt files. This means that both blog.example.com & example.com should have sầu their own robots.txt files (at blog.example.com/robots.txt and example.com/robots.txt).

Xem thêm: Lời Quảng Cáo Bán Hàng Hay, 13 Stt Rao Bán Hàng Online Hay Trăm Đơn Mỗi Ngày

*

Technical robots.txt syntax

Robots.txt syntax can be thought of as the “language” of robots.txt files. There are five sầu common terms you’re likely come across in a robots file. They include:

Disallow: The command used to lớn tell a user-agent not khổng lồ crawl particular URL. Only one "Disallow:" line is allowed for each URL.

Pattern-matching

When it comes lớn the actual URLs khổng lồ blochồng or allow, robots.txt files can get fairly complex as they allow the use of pattern-matching to cover a range of possible URL options. Google and Bing both honor two regular expressions that can be used khổng lồ identify pages or subfolders that an SEO wants excluded. These two characters are the asterisk (*) & the dollar sign ($).

* is a wildcard that represents any sequence of characters$ matches the kết thúc of the URL

Google offers a great danh sách of possible pattern-matching syntax và examples here.

Where does robots.txt go on a site?

Whenever they come to lớn a site, search engines và other web-crawling robots (lượt thích Facebook’s crawler, Facebot) know to look for a robots.txt tệp tin. But, they’ll only look for that file in one specific place: the main directory (typically your root domain or homepage). If a user agent visits www.example.com/robots.txt & does not find a robots tệp tin there, it will assume the site does not have sầu one và proceed with crawling everything on the page (và maybe even on the entire site). Even if the robots.txt page did exist at, say, example.com/index/robots.txt or www.example.com/homepage/robots.txt, it would not be discovered by user agents và thus the site would be treated as if it had no robots tệp tin at all.

In order khổng lồ ensure your robots.txt file is found, always include it in your main directory or root domain.

Why vì you need robots.txt?

Robots.txt files control crawler access khổng lồ certain areas of your site. While this can be very dangerous if you accidentally disallow Googlebot from crawling your entire site (!!), there are some situations in which a robots.txt tệp tin can be very handy.

Some common use cases include:

Preventing duplicate nội dung from appearing in SERPs (note that meta robots is often a better choice for this)Keeping entire sections of a trang web private (for instance, your engineering team’s staging site)Keeping internal tìm kiếm results pages from showing up on a public SERPSpecifying the location of sitemap(s)Preventing search engines from indexing certain files on your website (images, PDFs, etc.)Specifying a crawl delay in order khổng lồ prsự kiện your servers from being overloaded when crawlers load multiple pieces of nội dung at once

If there are no areas on your site lớn which you want to control user-agent access, you may not need a robots.txt file at all.

Checking if you have sầu a robots.txt file

Not sure if you have a robots.txt file? Simply type in your root domain, then add /robots.txt khổng lồ the over of the URL. For instance, zerovn.net’s robots file is located at zerovn.net/robots.txt.

If no .txt page appears, you vị not currently have sầu a (live) robots.txt page.

How khổng lồ create a robots.txt file

If you found you didn’t have sầu a robots.txt file or want to lớn alter yours, creating one is a simple process. This article from Google walks through the robots.txt tệp tin creation process, and this tool allows you to lớn test whether your file is set up correctly.

Looking for some practice creating robots files? This blog post walks through some interactive examples.

SEO best practices

Make sure you’re not blocking any nội dung or sections of your website you want crawled.

Some tìm kiếm engines have sầu multiple user-agents. For instance, Google uses Googlebot for organic search và Googlebot-Image for image search. Most user agents from the same tìm kiếm engine follow the same rules so there’s no need lớn specify directives for each of a tìm kiếm engine’s multiple crawlers, but having the ability to lớn vị so does allow you to lớn fine-tune how your site nội dung is crawled.

Xem thêm: ‪Wei Tao‬ - Tạo Tài Khoản Gmail

Robots.txt vs meta robots vs x-robots

So many robots! What’s the difference between these three types of robot instructions? First off, robots.txt is an actual text file, whereas meta and x-robots are meta directives. Beyond what they actually are, the three all serve different functions. Robots.txt dictates site or directory-wide crawl behavior, whereas meta và x-robots can dictate indexation behavior at the individual page (or page element) màn chơi.

Keep learning

Put your skills to lớn work

zerovn.net Pro can identify whether your robots.txt file is blocking our access to your trang web. Try it >>


Chuyên mục: Công cụ tìm kiếm