Hacker Newsnew | past | comments | ask | show | jobs | submit | dostick's commentslogin

Actually, four emails, not ten. Author writes as if it’s some conspiracy of sellers and shipping companies to maximise the number of emails. Each sends with any excuse they have. The email is treated as a drop box of transactional notes that business sends to customers inbox so customer can always find that info if they would have a need. It’s not frivolous sending that we need to fix but some standard of “receipt” folders, like Gmail auto folders in half-assed way. So these emails bypass inbox directly to special folder. And it should have a standard name so customer service can say “look in your Receipts folder”.

And Two “We received your order” is unnecessary, as well as “create account”. But if they send those it must be working? Or they send even is only handful of people click on them?


The real score should be around 50% or less. The scoring system is done as a joke without much thought and compares a lot of apples to oranges. Like “aw my balls” equals Jackass, even describing what’s different about them it counts them as equal. Costco degree is not equal to Microsoft degree, etc.

The tests should have negative weights based on how often that issue encountered and impact. The 2. SPI should have like 8 negative points out of 10 as most common blocker. And whole test inverse score.

Yeah, good call, we're on the same page about that. I designed this tool (agentreadingtest.com) to raise awareness of these issues in a more general way, so people can point agents at it and see how it performs for them. Separately, I maintain a related tool that can actually assess these issues in documentation sites: https://afdocs.dev/

My weighting system there scores the number of pages affected by SPA and caps the possible score at a "D" or "F" depending on the proportion of pages affected: https://afdocs.dev/interaction-diagnostics.html#spa-shells-i...

I've tried to weight things appropriately in assessing actual sites, but for the test here, I more wanted to just let people see for themselves what types of failures can occur.


Mentioning ULTRATHINK in prompt is the equivalent to /effort max?

Yes but only for the message that includes it. Whereas /effort max keeps it at max effort the entire convo, to my knowledge

No, ultrathink puts it in /effort high mode. There's no kw for one turn of effort max

How is your offering different from local ollama?

Its batteries included. No config.

We also fine tuned and did RL on our model, developed a custom context engine, trained an embedding model, and modified MLX to improve inference.

Everything is built to work with each other. So it’s more like an apple product than Linux. Less config but better optimized for the task.


I only understood half of the tech jargon in your answer. If I understood it all I’d probably run it myself. If someone who is less knowing than me is your customer, you need to explain in simpler terms!

Fair enough! The simple answer is: we did a lot of work to make the model better at coding without requiring complicated installation or configuration. One comman to install and run.

All the benefits of claude code, without any of the limitations or rug pulls.


I’m not nitpicking, but you’re saying better than Claude or Codex? Is it also focused and tested mainly on web/JS technologies? It’s still berry much uphill battle building native apps. I think there’s untapped market for Swift / Android coding models.

Actually, we're more broadly trained than most models. We did long tail training across languages, so we improved execution with languages like java, swift, and even cobol.

It's definitely a david vs goliath. But we know there's a subset of devs who need the privacy or unlimited nature of local.


Since Tor has become increasingly susceptible to state monitoring of exit nodes, making app rely on Tor is potentially compromising your future users. Look into i2p or other protocol that’s really anonymous.

Yes, I agree Tor is not the best anonymity service these days. I2P was my first choice, but the performance was just awful. I did not fully give up on the I2P idea, maybe it was just that day, so I will give it a second choice and mazbe add a third mode or fully replace tor. Not sure since a lot of people are familiar with tor, and not I2P

Whom you want to please with TOR support.. You have advantage not being a commercial product driven by recognition, to be free to base it on the next and better thing.

“Go on” works fine too

The post reminded me how I investigated a similar issue having no idea. Using Claude or GPT to investigate this kind of hardware issue is fast and easy. It gives you next command to try and then next one and you end up with similar summary. I wouldn’t be surprised that author didn’t know anything about displays before this.

So that’s what it is! I was wondering why reducing context and summarising still makes it make mistakes and forget the steering. And couldn’t find explanation to why it starts ignoring instructions when context is not full at all. How did you find that tool call is what degrades it? Isn’t this a biggest problem there is and not just “design tension”?


That’s quite a weak confidence in their own platform security if finding a root level vulnerability is not one-off event, but it’s a program expected to have multiple people routinely finding those.


Well it's a selection bias.

If an athlete breaks a world record, they're likely to do it again. Even though it's incredibly hard to break a world record.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: