Open In App

How to Tell Google Which Pages Not to Crawl

Last Updated : 01 Dec, 2023
Improve
Improve
Like Article
Like
Save
Share
Report

Google crawls all the pages of your website to index it in the search results. However, there are some pages that you may not want to be indexed by Google, such as login pages, admin pages, and pages with duplicate content (which are called canonical tags). There are two ways to tell Google which pages you don’t want to be crawled.

1. “Noindex” meta tag:

The “noindex” tag is a meta tag placed in the <head> part of the HTML code of a webpage. It is used to tell search engine bots not to index this post. The noindex meta tag tells Googlebot not to crawl and index this page in search results. The noindex meta tag is placed in the <head> section of your page.

<meta name=”robots” content=”noindex”>

The above code prevents all web crawler bots from indexing a page.

<meta name=”googlebot” content=”noindex”>

Note: In the above code, only Google web crawler bots are not allowed to index a page. Bing bots, yahoo bots, and other bots are allowed to index.

2. “Nofollow” meta tag:

The “nofollow” tag is another meta tag used to control how search engines follow links on a page. It tells search engines not to pass PageRank to the linked pages. It’s typically used to prevent the flow of authority to unimportant or untrusted pages.

Can “Noindex, Nofollow” Work Together?

Combine “noindex” and “nofollow” when you want to exclude a page from both search results and prevent search engines from following any links on that page. This is common for pages that you don’t want to be discovered or crawled by search engines.

3. HTTP response header

Search engine crawlers like Googlebot receive critical guidance from HTTP response headers on how to handle and index web pages. Whether a page is crawled and indexed or not can be influenced by certain headers. HTTP response headers that may affect Google’s capacity to crawl pages are listed below:

“X-Robots” meta tag:

X-Robots-Tag HTTP header returns with a value of either noindex or none in your response. You can also use these response headers for non-HTML resources, such as PDFs, video files, and image files. For some businesses, adding the robots meta tag manually or even programmatically may be more difficult than writing a script that will add the X-Robots-Tag. The outcome is the same for both this HTTP tag and the meta tag. Which of these strategies your website chooses to employ is a matter of preference.

In Apache-based web servers .htaccess and httpd.conf files are used. Regular expressions can be used to create complex crawling rules that are not possible with other methods. Crawling rules can be applied globally across a site, which is more efficient and easier to maintain than using other methods such as the robots.txt file.

HTTP/1.1 200 OK
X-Robots-Tag: noindex

For example: To add a noindex X-Robots-Tag to the HTTP response for all .PDF files across an entire site using Apache

<FilesMatch “\.pdf$”>
Header set X-Robots-Tag “noindex”
</FilesMatch>

Above snippet can be added to the site’s root .htaccess file or httpd.conf file.

For example: Adding a noindex X-Robots-Tag to the HTTP response for all .PDF files across an entire site using NGINX

location ~* \.(pdf)$ {
add_header X-Robots-Tag “noindex”;
}

Above snippet can be added to the site’s .conf file. Once you have added the X-Robots-Tag to your site’s configuration file, you will need to restart your web server for the changes to take effect.

4. Using robots.txt rules

The robots.txt file acts as a gatekeeper, before any good bots entering to your website they first visit the robots.txt file and read which pages are allowed to crawl and which are not. A robots.txt file tells the Google crawler bot which URLs the crawler can access on your website. You can also visit our robots.txt file by this URL: https://www.geeksforgeeks.org/robots.txt

User-agent: *
Disallow: /wp-admin/
Disallow: /community/
Disallow: /wp-content/plugins/
Disallow: /content-override.php
User-agent: ChatGPT-User
Disallow: /

Now lets explain above code

  • User-agent means bots.
  • * means all.
  • Disallow means if the URL contains this keyword don’t crawl.
  • For example:

https://www.geeksforgeeks.org/wp-admin/image.jpg did not allow to crawl.
But if the URL is https://www.geeksforgeeks.org/news/wp-admin/image.jpg
then https://www.geeksforgeeks.org/news is allowed to crawl
but https://www.geeksforgeeks.org/new/wp-admin/image.jpg is not.

  • User-agent: ChatGPT-User blocks the ChatGPT bot from crawling the whole website.

User-agent: *
Disallow: /

Above code block all web crawler to visit any page of website.
Note: If you want any URL to deindex from Google Search quickly you can use Google Search Console removal request from your GSC account.

Conclusion

Using these methods, you can effectively communicate to Google which pages or directories you don’t want to be crawled or indexed. Keep in mind that it may take some time for Google to re-crawl your site and update its index based on these directives. Additionally, for sensitive or confidential information, it’s important to use proper access controls and authentication in addition to these methods.

How to tell Google Which Pages Not to Crawl

Telling Google not to crawl specific pages is essential to avoid wasting crawl budget on low-value or irrelevant content. This is especially important for large websites with extensive content. You can use the “robots.txt” file to tell Google Which Pages Not to Crawl.

User-agent: *
Disallow: /private/
Disallow: /admin/

In this example, we are telling all user agents (search engine bots) not to crawl pages within the “/private/” and “/admin/” directories.

Which Pages to Keep Out of Search Engine Indexing

Not all pages on your website should be indexed by search engines. For instance, duplicate content, privacy policy pages, or certain landing pages may not need to appear in search results. To prevent indexing, you can use the “noindex” meta tag within the HTML of the page:

<meta name=”robots” content=”noindex”>

How to Identify Pages That Should Be Removed

If you have pages that you want to completely remove from Google’s index, consider using the Google Search Console’s “Remove URL” tool. This is particularly useful for outdated or sensitive content that you no longer want to be associated with your website.

Essential Tools for URLs Removal

The Google Search Console is the primary tool for managing the removal of URLs from Google’s index. It provides options to temporarily hide URLs or request the removal of specific pages.

Preventing Indexing for Enhanced Privacy

To prevent indexing, you can use the “noindex” directive, which instructs search engines not to include a particular page in their index.

<meta name=”robots” content=”noindex”>

Methods for De-Indexing from Google

  1. Robots.txt: You can use a robots.txt file to disallow specific pages or directories from being crawled.
  2. Meta Robots Tag: Placing a “noindex” meta tag in the HTML of a page instructs search engines not to index that page.
  3. HTTP Header Response: You can use the “X-Robots-Tag” HTTP header to send a “noindex” directive to search engines.


Like Article
Suggest improvement
Previous
Next
Share your thoughts in the comments

Similar Reads