Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Another common unintended consequence I've seen is conflating crawling and indexing with regards to robots.txt.

If you make a new page and never want it to enter the Google search index, adding it to robots.txt is fine, Google will never crawl it and it will never enter the index.

If you have an existing page that is currently indexed and want to remove it, adding that page to robots.txt is a BAD idea though. In the short term Google will continue to show the page in search results, but show it with no metadata (because it can't crawl it anymore). Even worse, Google won't notice up any noindex tags on the page, because robots.txt is blocking the page from being crawled!

Eventually Google will get the hint and remove the page from the index, but it can be a very frustrating time waiting for that to happen.



There are cases where Google might find a URL blocked in robots.txt (through external or internal links), and the page can still be indexed and show up in the search results, even if they can't crawl it. [1].

The only way to be sure that it will stay out of the results is to use a noindex tag. Which, as you mentioned, search engine bots need to "read" in the code. If the URL is blocked, the "noindex" cannot be read.

[1] https://developers.google.com/search/docs/crawling-indexing/... (refer to the red "Warning" section)


It is an interesting tidbit. I personally don't need Google to remove it from indexing. It is more of a "I don't care if they index it". I mostly care about the scrapping and not indexing. I do understand that these terms could be used interchangeably. In the past I might have conflated them.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: