What is a robots.txt file?

A robots.txt file is a set of instructions for bots located in a website's source files. It instructs good bots, like search engine web crawlers, on which parts of a website they are allowed to access and which they should avoid, helping to manage traffic and control indexing. It can also provide instructions to AI crawlers.

What is a web crawler?

A web crawler is an automated bot that visits and indexes webpages for search engines, helping users find content through search results.

What is the Robots Exclusion Protocol?

The Robots Exclusion Protocol is the format for instructions in a robots.txt file. The protocol tells web crawlers which webpages or resources they should not access or crawl on a website.

What does 'User-agent' mean in a robots.txt file?

"User-agent" specifies which bot or group of bots a set of instructions applies to in a robots.txt file. "User-agent: *" means the rule applies to all bots.

What is the Disallow command in robots.txt?

The Disallow command tells bots not to crawl specific pages or directories on a website. For example, "Disallow: /private/" blocks access to the "private" directory.

What is the Sitemaps protocol in robots.txt?

The Sitemaps protocol allows website owners to include links to their sitemap XML files in robots.txt, helping bots discover which pages should be crawled.

What’s the difference between good bots and bad bots?

Good bots follow robots.txt instructions and perform helpful services. Search engine web crawlers, for instance, typically honor robots.txt rules as they index content for search. Bad bots often ignore robots.txt and may scrape content, attack websites, or send excessive requests that drive up costs for the website.

What does the Crawl-delay command do in robots.txt?

The Crawl-delay command tells bots how long to wait between requests to avoid overloading a server. Not all bots respect this command: Googlebot, for instance, does not, although Google allows website administrators to set a similar rule through Google Search Console.

How does robots.txt impact SEO optimization?

A well-constructed robots.txt file can improve SEO by controlling which pages are indexed by search engines, preventing non-essential or duplicate content from being indexed. Additionally, robots.txt can help web crawlers find all the pages they should index via Sitemaps.

什么是 Robots.txt？| Robots.txt 文件的工作原理

robots.txt 文件列出网站针对机器人行为的偏好。它告诉机器人应该和不应该访问哪些网页。Robots.txt 文件与 Web 爬虫最相关。

学习目标

阅读本文后，您将能够：

了解什么是 robots.txt 文件
了解机器人如何与 robots.txt 文件进行交互
探索 robots.txt 文件中使用的协议，包括机器人排除协议和站点地图

想要继续学习吗？

订阅 TheNET，这是 Cloudflare 每月对互联网上最流行见解的总结！

复制文章链接

使用 Cloudflare 解决方案，抵御凭据填充和内容抓取之类的机器人攻击

开始阻止恶意机器人

什么是 robots.txt？

Robots.txt 文件是针对机器人的一组指引规则。该文件包含在大多数网站的源文件中。Robots.txt 文件旨在管理 Web 爬虫等机器人的活动，但并非所有机器人都将遵循这些说明。

可以将 robots.txt 文件视为贴在健身房、酒吧或社区中心墙上的"行为准则"标牌：标牌本身无权执行所列规则，但"有素质"的顾客将遵守规则，而"没有素质的"顾客可能会打破规则并被驱逐。

机器人是与网站和应用交互的自动化计算机程序。其中一种机器人称为 Web 爬虫机器人。这些机器人会“爬取”网页并索引内容，以便其可以显示在搜索引擎结果中。robots.txt 文件可帮助管理这些 Web 爬虫的活动，避免给托管网站的 Web 服务器带来过重负担，或索引那些非供公开查看的页面。robots.txt 文件还可以帮助管理 AI 爬虫机器人的活动，这些机器人有时对 Web 服务器的要求远远超过传统的 Web 爬虫。

Robots.txt 文件如何工作？

Robots.txt 文件只是一个没有 HTML 标记代码的文本文件（因此扩展名为 .txt）。robots.txt 文件与网站上的任何其他文件一样，都托管在网络服务器上。实际上，通常可以通过输入主页的完整 URL，然后添加 /robots.txt 来查看任何给定网站的 robots.txt 文件，例如 https://www.cloudflare.com/robots.txt。该文件未链接到网站上的任何其他位置，因此用户不太可能会偶然发现该文件，但是大多数网页爬网程序机器人都会在抓取该网站的其余部分之前先查找该文件。

虽然 robots.txt 文件为机器人提供了指令，但无法实际强制执行这些指令。一些机器人，例如 Web 爬虫或新闻推送机器人，可能会在查看域上的其他页面之前先访问 robots.txt 文件，并可能会遵循指引。其他机器人要么忽略 robots.txt 文件，要么通过解析该文件以查找被禁止访问的网页。

一个遵守 robots.txt 规范的 Web 爬虫机器人将遵循 robots.txt 文件中最具体的指令。如果文件中有矛盾的命令，则机器人将遵循更细化的命令。

要注意的一件事是，所有子域都需要有自己的 robots.txt 文件。例如，尽管 www.cloudflare.com 拥有自己的文件，但所有 Cloudflare 子域（blog.cloudflare.com，community.cloudflare.com等）也需要它们自己的文件。

Robots.txt 文件中使用了哪些协议？

在网络中，协议是用于提供指令或命令的格式。Robots.txt 文件使用几种不同的协议。主要协议称为“机器人排除协议 (Robots Exclusion Protocol)”。这是一种告诉机器人避免使用哪些网页和资源的方法。robots.txt 文件中包含为此协议格式化的说明。

用于 robots.txt 文件的其他协议是 Sitemaps 协议。这可以视为机器人包含协议。站点地图会向网络抓取工具显示其可以抓取的页面。这有助于确保网页爬网程序机器人不会遗漏任何重要页面。

Robots.txt 文件示例

这是 www.cloudflare.com 的 robots.txt 文件的一个旧版本：

下面我们将分解所有这些含义。

什么是用户代理？“User-agent: *”（用户代理）是什么意思？

在互联网上活动的任何人或程序都将有"用户代理"或分配的名称。对于人类用户，这包括诸如浏览器类型和操作系统版本之类的信息，但不包括个人信息。它可以帮助网站显示与用户系统兼容的内容。对于机器人，用户代理（理论上）可帮助网站管理员了解哪种机器人正在爬网该网站。

在 robots.txt 文件中，网站管理员可以通过为机器人用户代理编写不同的说明来为特定机器人提供特定说明。例如，如果管理员希望某个页面显示在 Google 搜索结果中而不显示在 Bing 搜索结果中，则它们可以在 robots.txt 文件中包含两组命令：一组命令前面带有"User-agent: Bingbot "和另一组前面带有"User-agent: Googlebot" 。

在上面的示例中，Cloudflare 在 robots.txt 文件中包含了 "User-agent: *"。星号表示“通配符”用户代理，这意味着该指令适用于每个机器人，而不是任何特定机器人。

通用搜索引擎机器人用户代理名称包括：

Google：

Googlebot
Googlebot-Image（用于图像）
Googlebot-News（用于新闻）
Googlebot-Video（用于视频）

Bing

Bingbot
MSNBot-Media（用于图像和视频）

Baidu

Baiduspider

“Disallow”（禁止）命令在 robots.txt 文件中的工作原理是什么？

Disallow （禁止）指令是 robots exclusion 协议中最常见的指令。它指示机器人不要访问指令后出现的网页或网页集合。禁止的页面不一定是“隐藏”的——它们只是对普通的 Google 或 Bing 用户没有用处，因此不会向其显示。大多数情况下，如果网站用户知道在哪里找到这些页面，他们仍然可以访问这些页面。

Disallow命令可以通过多种方式使用，上面的示例中显示了几种方式。

阻止一个文件（换言之，也就是一个特定的网页）

例如，如果 Cloudflare 希望阻止机器人爬取我们的文章“什么是机器人？”，那就会编写如下的命令：

Disallow: /learning/bots/what-is-a-bot/

在"disallow"指令之后，网页 URL 中主页之后的部分——在本例中为"www.cloudflare.com"——包含在内。通过这一指令，遵循 robots.txt 指示的机器人将无法访问 https://www.cloudflare.com/learning/bots/what-is-a-bot/，因此，该页面可能不会出现在传统搜索引擎的结果中。

阻止一个目录

有时一次阻止多个页面比单独列出所有页面更有效。如果它们都在网站的同一部分中，robots.txt 文件可以直接阻止包含它们的目录。

还是利用上面的示例：

Disallow: /__mesa/

这意味着__ mesa目录中包含的所有页面都不应被爬网。

允许完全访问

这样的命令以如下方式呈现：

Disallow:

这是告诉机器人可以浏览整个网站，因为没有任何页面是不允许访问的。

向机器人隐藏整个网站

Disallow: /

这里的“/”表示一个网站层次结构的“根”，或者其它分支页面的上层来源，所以它包括主页和所有从它链接出去的网页。使用此命令，搜索引擎机器人可能完全无法抓取该网站。

机器人排除协议还包含哪些其他命令？

Allow（允许）：如同大家所想，“Allow”指令告诉机器人它们被允许访问特定的网页或目录。此命令表示网站允许机器人访问一个特定网页，同时禁止文件中的其余网页。并非所有搜索引擎都识别此指令。

Crawl-delay（抓取延迟）：抓取延迟命令用于防止搜索引擎蜘蛛机器人使服务器负担过重。它允许管理员指定机器人在每个请求之间应等待的时间（以毫秒为单位）。一个等待8毫秒的抓取延迟命令的示例如下：

Crawl-delay: 8

Google 不识别此指令，尽管其他搜索引擎通常可以识别。对于 Google，管理员可以在 Google Search Console 中更改其网站的抓取频率。

什么是 Sitemaps（站点地图）协议？为什么将它包含在 robots.txt 中？

Sitemaps 协议可帮助机器人知道它要爬网的网站内容。

站点地图是一个如下所示的XML文件：

这是一份机器可读的网站上所有页面列表。通过 Sitemaps 协议，指向这些 Sitemaps 的链接可以包含在 robots.txt 文件中。格式为："Sitemaps:"后面加上 XML 文件的网址。您可以在上面的 Cloudflare robots.txt 文件中看到几个示例。

虽然 Sitemaps 协议有助于确保网络蜘蛛机器人在爬网网站时不会错过任何内容，但它们仍将遵循其典型的爬网过程。站点地图不会强制网页爬网程序机器人以不同的方式对网页进行优先级排序。

Robots.txt与机器人管理有何关系？

管理机器人对于保持网站或应用的正常运行至关重要，因为即使善意的机器人活动也可能会使源站服务器负担过重进而导致 Web 资产减慢或关闭。结构完善的 robots.txt 文件可让网站针对 SEO 进行优化，并控制行为良好的机器人活动。robots.txt 文件对管理恶意机器人流量并没有多大帮助。

尽管 robots.txt 文件很重要，但在 2025 年，Cloudflare 发现其排名前 10000 的网站中，仅 37% 配置了 robots.txt 文件。这意味着，很大一部分网站（甚至可能是绝大多数）都未使用这一工具。为了帮助这些网站，尤其是那些可能不希望其原创内容被用于 AI 训练的网站，Cloudflare 提供了“托管 robots.txt”服务。这项服务可代表您创建或更新网站的 robots.txt 文件，并应用您期望的设置。详细了解托管 robots.txt。

Robots.txt 复活节彩蛋

有时，robots.txt 文件会包含复活节彩蛋——开发人员包含的幽默信息，因为他们知道用户很少看到这些文件。例如，YouTube robots.txt 文件中写道：“Created in the distant future (the year 2000) after the robotic uprising of the mid 90's which wiped out all humans.（在 90 年代中期机器人起义消灭了所有人类之后，在遥远的未来（2000 年）创建。）”Cloudflare robots.txt 文件则写到，“Dear robot, be nice.（亲爱的机器人，友好一点。）”


#    .__________________________.
#    | .___________________. |==|
#    | | ................. | |  |
#    | | ::[ Dear robot ]: | |  |
#    | | ::::[ be nice ]:: | |  |
#    | | ::::::::::::::::: | |  |
#    | | ::::::::::::::::: | |  |
#    | | ::::::::::::::::: | |  |
#    | | ::::::::::::::::: | | ,|
#    | !___________________! |(c|
#    !_______________________!__!
#   /                            \
#  /  [][][][][][][][][][][][][]  \
# /  [][][][][][][][][][][][][][]  \
#(  [][][][][____________][][][][]  )
# \ ------------------------------ /
#  \______________________________/

Google 在以下位置还有“humans.txt”文件：https://www.google.com/humans.txt

常见问题解答

什么是robots.txt文件？

robots.txt 文件是网站源文件中的一份清单，包含网站针对机器人行为的偏好设置。它为善意机器人（例如搜索引擎 Web 爬虫）提供指引，表明网站中允许访问的部分和应避免的部分，从而帮助管理流量和控制索引。它还可以列出针对 AI 爬虫的规则。

什么是 Web 爬虫？

Web 爬虫是一种自动化机器人程序，用于访问网页并为搜索引擎编制索引，帮助用户通过搜索结果找到内容。

什么是“机器人排除协议”？

机器人排除协议是 robots.txt 文件中指令的格式。该协议告诉 Web 爬虫它们不应访问爬取网站上的哪些网页或资源。

robots.txt 文件中“User-agent”是什么意思？

“User-agent” 指定在 robots.txt 文件中一组指令适用于哪个机器人或机器人群组。"User-agent: *" 表示该规则适用于所有机器人。

robots.txt 中的 Disallow 指令是什么？

Disallow 指令告诉机器人不要爬取网站上的特定页面或目录。例如，“Disallow: /private/” 告诉机器人不要访问“private”目录。

robots.txt 中的 Sitemaps（站点地图）协议是什么？

Sitemaps 协议允许网站所有者在 robots.txt 文件中包含指向其站点地图 XML 文件的链接，帮助机器人发现哪些页面应该被爬取。

善意机器人和恶意机器人有什么区别？

善意机器人更可能遵循 robots.txt 说明；它们也执行有用的服务。例如，搜索引擎的 Web 爬虫在为搜索索引内容时，通常会遵循 robots.txt 规则。恶意机器人通常会忽略 robots.txt，并可能抓取内容、攻击网站或发送过多请求，从而增加网站的成本。

Crawl-delay 命令在 robots.txt 中的作用是什么？

Crawl-delay 命令告知机器人在请求之间应等待多长时间，以避免服务器过载。并非所有机器人都会遵守此命令：例如，Googlebot 就不会，不过 Google 允许网站管理员通过 Google Search Console 设置类似的规则。

robots.txt 如何影响 SEO 优化？

结构良好的 robots.txt 文件可以通过告知搜索引擎机器人哪些页面需要索引，从而改善 SEO，这有助于防止不必要或重复内容被编入索引。此外，robots.txt 可以帮助 Web 爬虫通过站点地图找到所有应索引的页面。

入门指南

关于机器人

机器人攻击

Bot Management

词汇

学习中心导航