Crawler Hints supports Microsoft’s IndexNow in helping users find new content

richdougherty · on Aug 27, 2022

Here's the IndexNow standard that CloudFlare Crawler Hints is using:

https://www.indexnow.org/

The idea is that you can push a notification to a search engine when content changes instead of waiting for the crawler to notice.

https://<searchengine>/indexnow?url=url-changed&key=your-key

You can also submit more than one URL with a POST.

You can notify Bing at https://www.bing.com/indexnow?url=url-changed&key=your-key

If you notify the IndexNow API endpoint it notifies Bing plus other search engines on your behalf:

https://api.indexnow.org/indexnow?url=url-changed&key=your-k...

This announcement is about how CloudFlare can now do this automatically for sites it hosts.

Some other hosts and CDNs support IndexNow, eg Akamai. See: https://blogs.bing.com/webmaster/october-2021/IndexNow-Insta...

orf · on Aug 27, 2022

Why would they not use the /.well-known/ prefix for the default index now key?

The default being at the root seems… stupid.

richdougherty · on Aug 27, 2022

Not defending the standard, but I guess since this is a shared secret you don't want to put it at a well known location. There's a (slight) attack vector from having an attacker know the secret, since they can "launch" a crawl against a site. Maybe could get a crawler to access private URLs or something?

Another interesting feature I saw in the standard is that you can host keys in subdirectories too.

"the location of a key file determines the set of URLs that can be included with this key. A key file located at http://example.com/catalog/key12457EDd.txt can include any URLs starting with http://example.com/catalog/ but cannot include URLs starting with http://example.com/help/."

orf · on Aug 27, 2022

This wouldn’t be in a place like “.well-known/secret-key”, the key would still be part of the path. It’s just a well known prefix to put exactly this kind of thing.

richdougherty · on Aug 27, 2022

I was going to say that well-known is only for stable paths, where you want to avoid collisions, not for paths with random keys... but you're right:

https://www.rfc-editor.org/rfc/rfc8615#section-3

   Registrations MAY also contain additional information, such as the
   syntax of additional path components, query strings, and/or fragment
   identifiers to be appended to the well-known URI, or protocol-
   specific details (e.g., HTTP [RFC7231] method handling).

So it could be: /.well-known/index-now/<key>

IndexNow would need to change the semantics of how they handle directories, as a key authorises only subdirectories.

I also notice there is an option for changing the filename of the IndexNow key file, but there is less flexibility about the directory it's hosted:

    https://<searchengine>/indexnow?url=http://www.example.com/product.html&key=af4c4e043c7d42afad6bdeeda948527d&keyLocation=http://www.example.com/myIndexNowKey63638.txt

This seems like a potential vulnerability as if an attacker knows a text file path that contains a known (hex?) string it looks they could use it as a key?

speedgoose · on Aug 27, 2022

So it’s an alternative to sitemaps, documents that can list incrementally all the web pages of a website with their last modification date-time, but with a push model instead of a pull model?

https://www.sitemaps.org/

rob-olmos · on Aug 27, 2022

Yes, but of course only for the search engines that support it. Eg, Google is absent from the list, although they've already had a push/ping sitemap feature.

inerte · on Aug 28, 2022

I thought so too, but search engines supporting IndexNow are supposed to tell each other about new pages, and there's a bit more of control since you can send just the urls you want instead of an entire sitemap.

richdougherty · on Aug 28, 2022

If it was widely used it's win-win. More timely updates and less traffic crawling pages which haven't changed. That's if it ever becomes a reliable signal of page changes.

cmroanirgo · on Aug 27, 2022

Although I agree heartily with the idea of a push model for search engines, I can't help but notice that it seems to provide more centralisation to the search engines out there.

Here on HN we've been seeing posts of alternate search engines. How will those small bespoke engines make use of IndexNow unless the website participates?

The way I see IndexNow, I'll still get crawled relentlessly by the bots I don't want crawling my site (because robots.txt never seems to apply to them unless there's a special listing explicitly for them)

So, unless you're a participating search engine, a website will still be getting crawled by low hanging fruit, not alleviating the problem.

A good compromise would be something like an RSS feed, which a site can publish, and crawlers can hit for updated changes. It would also allow easier management for those domains that have many moving parts: individual search engines can be pinged, but the search engine just grabs the changes.xml file... Or something.

rstupek · on Aug 27, 2022

It looks like a search engine could get listed here: https://www.indexnow.org/searchengines.json and any website which implements IndexNow could utilize that list to know where to publish?

There already is such an "RSS" feed, its called a sitemap available at /sitemap.xml or you can alternatively list your url in the robots.txt file

firecall · on Aug 28, 2022

The issue with that approach is the same one that destroyed the trust in meta keywords!

The lack of trust means a search engine needs to know if what it's being presented in metadata is actually what's being served to the browser!

That's why we can't have nice things! :-)

taylorfinley · on Aug 27, 2022

“We’re also hopeful that Google, the world’s largest search engine and most wasteful user of Internet resources, will adopt IndexNow or a similar standard and lower the burden of search crawling on the planet.” Pretty blunt language from CloudFlare!

JPLeRouzic · on Aug 28, 2022

From the point of view of a little web site owner, Google certainly has a weird behavior. It comes several times per day (sometimes 20 times per day) even if a robot.txt tells to come only every 7 days (it's a blog with news every four days on average).

Petalsearch (Huawei) indexing rate is half of Google, Apple and Bing half of Petalsearch, Yandex half of Bing. The other search engines respect the robot.txt.

I wonder how much energy and Internet capacity is wasted because of Google and PetalSearch indexing.

dgivney · on Aug 27, 2022

Sharing your key seems like the most 90s approach to system design.

"Only you and the search engines should know the key.. so obviously, we want you to host it in plain text, in the root directory."

rstupek · on Aug 27, 2022

The key appears to not be a fixed value so unless your server allows directory scans it seems reasonably secure?

dgivney · on Aug 27, 2022

I agree, in a 90s system design meeting - security through obscurity is reasonably secure.

zhfliz · on Aug 27, 2022

without referring to this particular case, how is `/.well-known/LsyrYyZGDMMPwS1lAUS7qXo7c81XLaxPeRrSZdSReFk5zPaJaD` less secure than `/.well-known/key` requiring an `Authorization: LsyrYyZGDMMPwS1lAUS7qXo7c81XLaxPeRrSZdSReFk5zPaJaD` header?

JimDabell · on Aug 27, 2022

URLs are logged in all kinds of places, they end up everywhere. Authorization headers are not.

zhfliz · on Aug 27, 2022

that's a good point, although it wouldn't make a difference from the outside perspective.

it is however much easier to serve static content than evaluating headers. the benefit of significantly increased compatibility in how you can serve the content probably outweighs the risk of logging the secret in many cases, as static content serving is compatible with virtually anything, adding additional logic to be evaluated at runtime through other means than URL contents is not as widely supported.

metadat · on Aug 27, 2022

This is a novel concept. I wonder if / when Marginalia will get onboard and implement support for it, too.

It would be cool to be able to push the update signal to a bunch of search engines when I publish a new page (even if all of my websites get virtually no traffic and don't even come up with the appropriate, highly unique keyword combos in Google, Bing, or Marginalia - they don't even have any ads or anything terrible, perhaps terribly boring or SEO unoptimized, haha).

I wonder if there could be a market for something which collects website change information and offers an API to query for new pages / updated pages over the past X time interval across Y set of [interesting] websites. I could see this being useful, sort of like RSS but 100% general purpose.

I'd call it something like Invertdex.

marginalia_nu · on Aug 27, 2022

> I wonder if / when Marginalia will get onboard and implement support for it, too.

I created an issue for it as a reminder. Probably not gonna implement support for this type of active crawling hints in the short term, because a lot of my model is based on full-site crawls at 8 week intervals. While something like this would help with identifying new content, it doesn't do much to identify when links go dead, so you still need to crawl passively.

Although, on some level I'm a bit uneasy, I have a hunch this may be a bit of a vulnerability. In general giving websites the tools to control the crawling process beyond like robots.txt and so on seems a bit sketch. Maybe it's possible to build checks and balances to prevent that though.

Would take some work to support real-time updates. Not impossible, but I've got a lot of work to do with regard to search result accuracy that I feel is more important.

I've been looking at using RSS feeds as a signal for when to re-index sites before, not doing that now but the general idea works I think

pacifika · on Aug 27, 2022

I’m concerned about the centralisation aspect of this (quick reading, could be wrong) which makes it harder for innovation to happen in the search industry.

rob-olmos · on Aug 27, 2022

IndexNow claims to distribute the pings to other participating search engines[1], and to participate they need to have "noticeable presence in at least one market", so yea still seems to be somewhat exclusive?

Eg, would Ahrefs or Semrush qualify to join the party?

1: https://www.indexnow.org/searchengines

wumpus · on Aug 27, 2022

Most search engine people don't call ahrefs or semrush search engines -- they're SEO tools.

Having to have existing market share as a search engine is going to really limit the number of participants.

richbell · on Aug 27, 2022

There is a noticeable lack of information about how to enroll. I suspect it's more of a "if you need to ask you aren't big enough" type deal.

marginalia_nu · on Aug 27, 2022

Or maybe just ask?

I've had staggering success with just ending emails to people, including businesses.

richbell · on Aug 27, 2022

I agree, you can get surprisingly far by just asking.

That said, I can't find any contact information on the IndexNow website.

jgrahamc · on Aug 28, 2022

From the announcement blog (https://blog.cloudflare.com/crawler-hints-how-cloudflare-is-...):

This is an exciting problem to solve, and we look forward to working with others that want to help the Internet be more efficient and performant while reducing needless energy consumption. We plan on having more news to share on this front soon. If you operate a bot that relies on content freshness and are interested in working with us on this project, please email crawlerhints@cloudflare.com.

richbell · on Aug 28, 2022

Perhaps I'm missing something but Crawler Hints seems to be completely separate from IndexNow — it's just a CloudFlare product that implements it.

franze · on Aug 27, 2022

for Google there is the workaround to update the sitemap.xml and then ping that sitemap to google.

i usually have a whole-sites-inventory-sitemap.xml which gets updated once per day/week/x and a limited update.rss (rss is a valid sitemap.xml format) that gets pinged either real time with every update or if-changed-5-minute interval.

wrote a book about it and other distribution concepts

weird-eye-issue · on Aug 28, 2022

Google has its own API that you can ping for individual URLs too. For WordPress sites with Yoast this happens automatically when publishing or updating posts

franze · on Aug 28, 2022

reference please

to my knowledge google indexing api is still whitelisting only (if not jobs search) and yoast does sitemap.xml update & ping.

weird-eye-issue · on Aug 28, 2022

Turns out I was mistaken, I really thought they had this.

Yoast updates the last modified field in the sitemap and pings Google that the sitemap changed

benreesman · on Aug 28, 2022

There's a really great opportunity to create monumental goodwill here at negligible financial cost. If 30% of gross traffic is crawlers, and half that is wasted, then that's a cornucopia.

Slice off a few percent of the savings to make the crawled/pushed/IdenxedNow pile free for the public to supplement other archival efforts, train ML models on, use for small-change operator recovery, seed torrents of public-domain content, etc. and I'll call it fair. All the (relevant) actors involved are so heavily peered that they could store it on 8TB M2 sticks and it would vanish into the noise of 1/6th of gross Internet throughput.

My initial instinct was to be all cynical like "great, a new model for SEO competitiveness that locks the heavy-hung dark fiber incumbents deeper into the fabric".

But I'm trying hard to be more positive and there is a really positive outcome here.

pabs3 · on Aug 28, 2022

Would be cool if archive.org could be updated using this.

pabs3 · on Aug 28, 2022

Woops, the article already quotes the archive.org folks saying they are looking forward to doing it.

kwerk · on Aug 27, 2022

Is CommonCrawl one of the IndexNow recipients for this? Seems like a big win for the open web if so to make an open index more efficient to hydrate.

richbell · on Aug 27, 2022

It seems that only 3 search engines are currently registered: Bing, Yandex, and Seznam.

https://www.indexnow.org/searchengines.json

pingiun · on Aug 27, 2022

[flagged]

richbell · on Aug 27, 2022

> Eschew flamebait. Avoid unrelated controversies, generic tangents, and internet tropes.

https://news.ycombinator.com/newsguidelines.html

BonoboIO · on Aug 27, 2022

Never heard of kiwifarms before… what a cesspool.

Cloudflare also hosts/protects a number of Austrian and German right wing and covid hoax „news sites“.

Contacted them, but they don’t care.

It’s a fine line between freedom of speech and censorship, but these websites are just absolute garbage! Spreading hate, racism, hoaxes and help fascist to take over power.

lupire · on Aug 28, 2022

Those websites also use serves, electricity, plumbing, an steal estate.

You may petition your government to ban them.