Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Crawler Hints supports Microsoft’s IndexNow in helping users find new content (cloudflare.com)
77 points by jgrahamc on Aug 27, 2022 | hide | past | favorite | 42 comments


Here's the IndexNow standard that CloudFlare Crawler Hints is using:

https://www.indexnow.org/

The idea is that you can push a notification to a search engine when content changes instead of waiting for the crawler to notice.

https://<searchengine>/indexnow?url=url-changed&key=your-key

You can also submit more than one URL with a POST.

You can notify Bing at https://www.bing.com/indexnow?url=url-changed&key=your-key

If you notify the IndexNow API endpoint it notifies Bing plus other search engines on your behalf:

https://api.indexnow.org/indexnow?url=url-changed&key=your-k...

This announcement is about how CloudFlare can now do this automatically for sites it hosts.

Some other hosts and CDNs support IndexNow, eg Akamai. See: https://blogs.bing.com/webmaster/october-2021/IndexNow-Insta...


Why would they not use the /.well-known/ prefix for the default index now key?

The default being at the root seems… stupid.


Not defending the standard, but I guess since this is a shared secret you don't want to put it at a well known location. There's a (slight) attack vector from having an attacker know the secret, since they can "launch" a crawl against a site. Maybe could get a crawler to access private URLs or something?

Another interesting feature I saw in the standard is that you can host keys in subdirectories too.

"the location of a key file determines the set of URLs that can be included with this key. A key file located at http://example.com/catalog/key12457EDd.txt can include any URLs starting with http://example.com/catalog/ but cannot include URLs starting with http://example.com/help/."


This wouldn’t be in a place like “.well-known/secret-key”, the key would still be part of the path. It’s just a well known prefix to put exactly this kind of thing.


I was going to say that well-known is only for stable paths, where you want to avoid collisions, not for paths with random keys... but you're right:

https://www.rfc-editor.org/rfc/rfc8615#section-3

   Registrations MAY also contain additional information, such as the
   syntax of additional path components, query strings, and/or fragment
   identifiers to be appended to the well-known URI, or protocol-
   specific details (e.g., HTTP [RFC7231] method handling).
So it could be: /.well-known/index-now/<key>

IndexNow would need to change the semantics of how they handle directories, as a key authorises only subdirectories.

I also notice there is an option for changing the filename of the IndexNow key file, but there is less flexibility about the directory it's hosted:

    https://<searchengine>/indexnow?url=http://www.example.com/product.html&key=af4c4e043c7d42afad6bdeeda948527d&keyLocation=http://www.example.com/myIndexNowKey63638.txt
This seems like a potential vulnerability as if an attacker knows a text file path that contains a known (hex?) string it looks they could use it as a key?


So it’s an alternative to sitemaps, documents that can list incrementally all the web pages of a website with their last modification date-time, but with a push model instead of a pull model?

https://www.sitemaps.org/


Yes, but of course only for the search engines that support it. Eg, Google is absent from the list, although they've already had a push/ping sitemap feature.


I thought so too, but search engines supporting IndexNow are supposed to tell each other about new pages, and there's a bit more of control since you can send just the urls you want instead of an entire sitemap.


If it was widely used it's win-win. More timely updates and less traffic crawling pages which haven't changed. That's if it ever becomes a reliable signal of page changes.


Although I agree heartily with the idea of a push model for search engines, I can't help but notice that it seems to provide more centralisation to the search engines out there.

Here on HN we've been seeing posts of alternate search engines. How will those small bespoke engines make use of IndexNow unless the website participates?

The way I see IndexNow, I'll still get crawled relentlessly by the bots I don't want crawling my site (because robots.txt never seems to apply to them unless there's a special listing explicitly for them)

So, unless you're a participating search engine, a website will still be getting crawled by low hanging fruit, not alleviating the problem.

A good compromise would be something like an RSS feed, which a site can publish, and crawlers can hit for updated changes. It would also allow easier management for those domains that have many moving parts: individual search engines can be pinged, but the search engine just grabs the changes.xml file... Or something.


It looks like a search engine could get listed here: https://www.indexnow.org/searchengines.json and any website which implements IndexNow could utilize that list to know where to publish?

There already is such an "RSS" feed, its called a sitemap available at /sitemap.xml or you can alternatively list your url in the robots.txt file


The issue with that approach is the same one that destroyed the trust in meta keywords!

The lack of trust means a search engine needs to know if what it's being presented in metadata is actually what's being served to the browser!

That's why we can't have nice things! :-)


“We’re also hopeful that Google, the world’s largest search engine and most wasteful user of Internet resources, will adopt IndexNow or a similar standard and lower the burden of search crawling on the planet.” Pretty blunt language from CloudFlare!


From the point of view of a little web site owner, Google certainly has a weird behavior. It comes several times per day (sometimes 20 times per day) even if a robot.txt tells to come only every 7 days (it's a blog with news every four days on average).

Petalsearch (Huawei) indexing rate is half of Google, Apple and Bing half of Petalsearch, Yandex half of Bing. The other search engines respect the robot.txt.

I wonder how much energy and Internet capacity is wasted because of Google and PetalSearch indexing.


Sharing your key seems like the most 90s approach to system design.

"Only you and the search engines should know the key.. so obviously, we want you to host it in plain text, in the root directory."


The key appears to not be a fixed value so unless your server allows directory scans it seems reasonably secure?


I agree, in a 90s system design meeting - security through obscurity is reasonably secure.


without referring to this particular case, how is `/.well-known/LsyrYyZGDMMPwS1lAUS7qXo7c81XLaxPeRrSZdSReFk5zPaJaD` less secure than `/.well-known/key` requiring an `Authorization: LsyrYyZGDMMPwS1lAUS7qXo7c81XLaxPeRrSZdSReFk5zPaJaD` header?


URLs are logged in all kinds of places, they end up everywhere. Authorization headers are not.


that's a good point, although it wouldn't make a difference from the outside perspective.

it is however much easier to serve static content than evaluating headers. the benefit of significantly increased compatibility in how you can serve the content probably outweighs the risk of logging the secret in many cases, as static content serving is compatible with virtually anything, adding additional logic to be evaluated at runtime through other means than URL contents is not as widely supported.


This is a novel concept. I wonder if / when Marginalia will get onboard and implement support for it, too.

It would be cool to be able to push the update signal to a bunch of search engines when I publish a new page (even if all of my websites get virtually no traffic and don't even come up with the appropriate, highly unique keyword combos in Google, Bing, or Marginalia - they don't even have any ads or anything terrible, perhaps terribly boring or SEO unoptimized, haha).

I wonder if there could be a market for something which collects website change information and offers an API to query for new pages / updated pages over the past X time interval across Y set of [interesting] websites. I could see this being useful, sort of like RSS but 100% general purpose.

I'd call it something like Invertdex.


> I wonder if / when Marginalia will get onboard and implement support for it, too.

I created an issue for it as a reminder. Probably not gonna implement support for this type of active crawling hints in the short term, because a lot of my model is based on full-site crawls at 8 week intervals. While something like this would help with identifying new content, it doesn't do much to identify when links go dead, so you still need to crawl passively.

Although, on some level I'm a bit uneasy, I have a hunch this may be a bit of a vulnerability. In general giving websites the tools to control the crawling process beyond like robots.txt and so on seems a bit sketch. Maybe it's possible to build checks and balances to prevent that though.

Would take some work to support real-time updates. Not impossible, but I've got a lot of work to do with regard to search result accuracy that I feel is more important.

I've been looking at using RSS feeds as a signal for when to re-index sites before, not doing that now but the general idea works I think


I’m concerned about the centralisation aspect of this (quick reading, could be wrong) which makes it harder for innovation to happen in the search industry.


IndexNow claims to distribute the pings to other participating search engines[1], and to participate they need to have "noticeable presence in at least one market", so yea still seems to be somewhat exclusive?

Eg, would Ahrefs or Semrush qualify to join the party?

1: https://www.indexnow.org/searchengines


Most search engine people don't call ahrefs or semrush search engines -- they're SEO tools.

Having to have existing market share as a search engine is going to really limit the number of participants.


There is a noticeable lack of information about how to enroll. I suspect it's more of a "if you need to ask you aren't big enough" type deal.


Or maybe just ask?

I've had staggering success with just ending emails to people, including businesses.


I agree, you can get surprisingly far by just asking.

That said, I can't find any contact information on the IndexNow website.


From the announcement blog (https://blog.cloudflare.com/crawler-hints-how-cloudflare-is-...):

This is an exciting problem to solve, and we look forward to working with others that want to help the Internet be more efficient and performant while reducing needless energy consumption. We plan on having more news to share on this front soon. If you operate a bot that relies on content freshness and are interested in working with us on this project, please email crawlerhints@cloudflare.com.


Perhaps I'm missing something but Crawler Hints seems to be completely separate from IndexNow — it's just a CloudFlare product that implements it.


for Google there is the workaround to update the sitemap.xml and then ping that sitemap to google.

i usually have a whole-sites-inventory-sitemap.xml which gets updated once per day/week/x and a limited update.rss (rss is a valid sitemap.xml format) that gets pinged either real time with every update or if-changed-5-minute interval.

wrote a book about it and other distribution concepts


Google has its own API that you can ping for individual URLs too. For WordPress sites with Yoast this happens automatically when publishing or updating posts


reference please

to my knowledge google indexing api is still whitelisting only (if not jobs search) and yoast does sitemap.xml update & ping.


Turns out I was mistaken, I really thought they had this.

Yoast updates the last modified field in the sitemap and pings Google that the sitemap changed


There's a really great opportunity to create monumental goodwill here at negligible financial cost. If 30% of gross traffic is crawlers, and half that is wasted, then that's a cornucopia.

Slice off a few percent of the savings to make the crawled/pushed/IdenxedNow pile free for the public to supplement other archival efforts, train ML models on, use for small-change operator recovery, seed torrents of public-domain content, etc. and I'll call it fair. All the (relevant) actors involved are so heavily peered that they could store it on 8TB M2 sticks and it would vanish into the noise of 1/6th of gross Internet throughput.

My initial instinct was to be all cynical like "great, a new model for SEO competitiveness that locks the heavy-hung dark fiber incumbents deeper into the fabric".

But I'm trying hard to be more positive and there is a really positive outcome here.


Would be cool if archive.org could be updated using this.


Woops, the article already quotes the archive.org folks saying they are looking forward to doing it.


Is CommonCrawl one of the IndexNow recipients for this? Seems like a big win for the open web if so to make an open index more efficient to hydrate.


It seems that only 3 search engines are currently registered: Bing, Yandex, and Seznam.

https://www.indexnow.org/searchengines.json


[flagged]


> Eschew flamebait. Avoid unrelated controversies, generic tangents, and internet tropes.

https://news.ycombinator.com/newsguidelines.html


Never heard of kiwifarms before… what a cesspool.

Cloudflare also hosts/protects a number of Austrian and German right wing and covid hoax „news sites“.

Contacted them, but they don’t care.

It’s a fine line between freedom of speech and censorship, but these websites are just absolute garbage! Spreading hate, racism, hoaxes and help fascist to take over power.


Those websites also use serves, electricity, plumbing, an steal estate.

You may petition your government to ban them.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: