Not defending the standard, but I guess since this is a shared secret you don't want to put it at a well known location. There's a (slight) attack vector from having an attacker know the secret, since they can "launch" a crawl against a site. Maybe could get a crawler to access private URLs or something?
Another interesting feature I saw in the standard is that you can host keys in subdirectories too.
This wouldn’t be in a place like “.well-known/secret-key”, the key would still be part of the path. It’s just a well known prefix to put exactly this kind of thing.
Registrations MAY also contain additional information, such as the
syntax of additional path components, query strings, and/or fragment
identifiers to be appended to the well-known URI, or protocol-
specific details (e.g., HTTP [RFC7231] method handling).
So it could be: /.well-known/index-now/<key>
IndexNow would need to change the semantics of how they handle directories, as a key authorises only subdirectories.
I also notice there is an option for changing the filename of the IndexNow key file, but there is less flexibility about the directory it's hosted:
This seems like a potential vulnerability as if an attacker knows a text file path that contains a known (hex?) string it looks they could use it as a key?
So it’s an alternative to sitemaps, documents that can list incrementally all the web pages of a website with their last modification date-time, but with a push model instead of a pull model?
Yes, but of course only for the search engines that support it. Eg, Google is absent from the list, although they've already had a push/ping sitemap feature.
I thought so too, but search engines supporting IndexNow are supposed to tell each other about new pages, and there's a bit more of control since you can send just the urls you want instead of an entire sitemap.
If it was widely used it's win-win. More timely updates and less traffic crawling pages which haven't changed. That's if it ever becomes a reliable signal of page changes.
Although I agree heartily with the idea of a push model for search engines, I can't help but notice that it seems to provide more centralisation to the search engines out there.
Here on HN we've been seeing posts of alternate search engines. How will those small bespoke engines make use of IndexNow unless the website participates?
The way I see IndexNow, I'll still get crawled relentlessly by the bots I don't want crawling my site (because robots.txt never seems to apply to them unless there's a special listing explicitly for them)
So, unless you're a participating search engine, a website will still be getting crawled by low hanging fruit, not alleviating the problem.
A good compromise would be something like an RSS feed, which a site can publish, and crawlers can hit for updated changes. It would also allow easier management for those domains that have many moving parts: individual search engines can be pinged, but the search engine just grabs the changes.xml file... Or something.
It looks like a search engine could get listed here: https://www.indexnow.org/searchengines.json and any website which implements IndexNow could utilize that list to know where to publish?
There already is such an "RSS" feed, its called a sitemap available at /sitemap.xml or you can alternatively list your url in the robots.txt file
“We’re also hopeful that Google, the world’s largest search engine and most wasteful user of Internet resources, will adopt IndexNow or a similar standard and lower the burden of search crawling on the planet.” Pretty blunt language from CloudFlare!
From the point of view of a little web site owner, Google certainly has a weird behavior. It comes several times per day (sometimes 20 times per day) even if a robot.txt tells to come only every 7 days (it's a blog with news every four days on average).
Petalsearch (Huawei) indexing rate is half of Google, Apple and Bing half of Petalsearch, Yandex half of Bing. The other search engines respect the robot.txt.
I wonder how much energy and Internet capacity is wasted because of Google and PetalSearch indexing.
without referring to this particular case, how is `/.well-known/LsyrYyZGDMMPwS1lAUS7qXo7c81XLaxPeRrSZdSReFk5zPaJaD` less secure than `/.well-known/key` requiring an `Authorization: LsyrYyZGDMMPwS1lAUS7qXo7c81XLaxPeRrSZdSReFk5zPaJaD` header?
that's a good point, although it wouldn't make a difference from the outside perspective.
it is however much easier to serve static content than evaluating headers.
the benefit of significantly increased compatibility in how you can serve the content probably outweighs the risk of logging the secret in many cases, as static content serving is compatible with virtually anything, adding additional logic to be evaluated at runtime through other means than URL contents is not as widely supported.
This is a novel concept. I wonder if / when Marginalia will get onboard and implement support for it, too.
It would be cool to be able to push the update signal to a bunch of search engines when I publish a new page (even if all of my websites get virtually no traffic and don't even come up with the appropriate, highly unique keyword combos in Google, Bing, or Marginalia - they don't even have any ads or anything terrible, perhaps terribly boring or SEO unoptimized, haha).
I wonder if there could be a market for something which collects website change information and offers an API to query for new pages / updated pages over the past X time interval across Y set of [interesting] websites. I could see this being useful, sort of like RSS but 100% general purpose.
> I wonder if / when Marginalia will get onboard and implement support for it, too.
I created an issue for it as a reminder. Probably not gonna implement support for this type of active crawling hints in the short term, because a lot of my model is based on full-site crawls at 8 week intervals. While something like this would help with identifying new content, it doesn't do much to identify when links go dead, so you still need to crawl passively.
Although, on some level I'm a bit uneasy, I have a hunch this may be a bit of a vulnerability. In general giving websites the tools to control the crawling process beyond like robots.txt and so on seems a bit sketch. Maybe it's possible to build checks and balances to prevent that though.
Would take some work to support real-time updates. Not impossible, but I've got a lot of work to do with regard to search result accuracy that I feel is more important.
I've been looking at using RSS feeds as a signal for when to re-index sites before, not doing that now but the general idea works I think
I’m concerned about the centralisation aspect of this (quick reading, could be wrong) which makes it harder for innovation to happen in the search industry.
IndexNow claims to distribute the pings to other participating search engines[1], and to participate they need to have "noticeable presence in at least one market", so yea still seems to be somewhat exclusive?
Eg, would Ahrefs or Semrush qualify to join the party?
This is an exciting problem to solve, and we look forward to working with others that want to help the Internet be more efficient and performant while reducing needless energy consumption. We plan on having more news to share on this front soon. If you operate a bot that relies on content freshness and are interested in working with us on this project, please email crawlerhints@cloudflare.com.
for Google there is the workaround to update the sitemap.xml and then ping that sitemap to google.
i usually have a whole-sites-inventory-sitemap.xml which gets updated once per day/week/x and a limited update.rss (rss is a valid sitemap.xml format) that gets pinged either real time with every update or if-changed-5-minute interval.
wrote a book about it and other distribution concepts
Google has its own API that you can ping for individual URLs too. For WordPress sites with Yoast this happens automatically when publishing or updating posts
There's a really great opportunity to create monumental goodwill here at negligible financial cost. If 30% of gross traffic is crawlers, and half that is wasted, then that's a cornucopia.
Slice off a few percent of the savings to make the crawled/pushed/IdenxedNow pile free for the public to supplement other archival efforts, train ML models on, use for small-change operator recovery, seed torrents of public-domain content, etc. and I'll call it fair. All the (relevant) actors involved are so heavily peered that they could store it on 8TB M2 sticks and it would vanish into the noise of 1/6th of gross Internet throughput.
My initial instinct was to be all cynical like "great, a new model for SEO competitiveness that locks the heavy-hung dark fiber incumbents deeper into the fabric".
But I'm trying hard to be more positive and there is a really positive outcome here.
Cloudflare also hosts/protects a number of Austrian and German right wing and covid hoax „news sites“.
Contacted them, but they don’t care.
It’s a fine line between freedom of speech and censorship, but these websites are just absolute garbage! Spreading hate, racism, hoaxes and help fascist to take over power.
https://www.indexnow.org/
The idea is that you can push a notification to a search engine when content changes instead of waiting for the crawler to notice.
https://<searchengine>/indexnow?url=url-changed&key=your-key
You can also submit more than one URL with a POST.
You can notify Bing at https://www.bing.com/indexnow?url=url-changed&key=your-key
If you notify the IndexNow API endpoint it notifies Bing plus other search engines on your behalf:
https://api.indexnow.org/indexnow?url=url-changed&key=your-k...
This announcement is about how CloudFlare can now do this automatically for sites it hosts.
Some other hosts and CDNs support IndexNow, eg Akamai. See: https://blogs.bing.com/webmaster/october-2021/IndexNow-Insta...