In a blog post on Friday, Google said that while advances in AI have benefited web users, creators, and the ecosystem, some publishers want increased control over how their content is utilized for emerging generative AI applications like automated text and image generation.
The new Google-Extended directive gives publishers a simple way to manage AI training permissions. Website administrators can allow or disallow Google's web crawlers from accessing their pages and data for purposes of training AI models like the Bard conversational system and Vertex AI platform.
"Making simple and scalable controls, like Google-Extended, available through robots.txt is an important step in providing transparency and control that we believe all providers of AI models should make available," Google stated.
The company acknowledged that as AI expands, publishers will face increasing complexity in managing different uses at scale. Google said it is committed to working with the web and AI communities to explore additional machine-readable approaches to enable publisher choice and control.
While Google aims to develop AI responsibly based on privacy and ethical principles, critics argue that generative AI like text and image generators can potentially replicate copyrighted content without permission. The opt-out control of Google-Extended allows publishers to decide whether to contribute to AI model training.
Google asserted that transparency and user agency are essential in AI development. The company plans to share more details soon about options for web publishers to control how their content is used for AI training purposes.
The Google-Extended robots.txt token is available immediately for web administrators to implement on their sites. Google said it believes all providers of AI models should make similar opt-out controls accessible to publishers.
How to add Google-Extended Crawler on Websites robot.txt File.
Here are the steps for web publishers to implement the Google-Extended control in their robots.txt file:
1. Open your robots.txt file for editing, located in the root directory of your website.
2. Add a new line with the following:
User-agent: Google-Extended
3. On the next line, add either:
Disallow: /
- To tell Google crawlers not to access any pages on your site for AI training.
Or:
Allow: /
- To allow Google access to all pages for AI training purposes (this is the default if no Google-Extended directive is present).
4. You can also specify partial access by using Disallow and Allow directives for specific paths, e.g.:
- Disallow: /private
- Disallow: /drafts
- Allow: /public
An example of robot.txt looks like this, if you Block Google crawlers from accessing any pages for AI training purposes -
Click Me - robot.txt Rules
User-agent: *
Disallow: /private
User-agent: Googlebot
Allow: /
User-agent: Google-Extended
Disallow: /
The key parts:
- The Google-Extended user-agent is defined on its own line.
- The Disallow directive on the next line blocks all URLs (/) from access by that user-agent.
- This prevents any pages on the site from being used for AI training by Google.
- Other sections allow regular Googlebot crawling of public pages.
- The Google-Extended control only affects AI training, not general search indexing.
With this example robots.txt configuration, the website administrator has selectively opted out of Google using any content on the site to improve its AI systems.
5. Save the updated robots.txt file and upload it to your site root to take effect.
6. Google's crawlers will now honor the Google-Extended directives when crawling your site.
7. Revisit your directives periodically to update access as needed for your AI training preferences.
The Google-Extended control gives granular options to manage AI training on a site-wide or page-specific basis. Implementing it in robots.txt provides a simple yet powerful way for web publishers to dictate their content permissions.