Why, though? Just because some people would find it odd? Who cares?
Trying to limit / disallow something seems to be hurting the overall accuracy of models. And it makes sense if you think about it. Most of our long-horizon content is in the form of novels and above. If you're trying to clamp the machine to machine speak you'll lose all those learnings. Hero starts with a problem, hero works the problem, hero reaches an impasse, hero makes a choice, hero gets the princess. That can be (and probably is) useful.
Is it? I don't think most of the content LLM are trained on is written in the first person. Wikipedia / news articles / other information articles don't aren't written in the first person. Most novels, or at least a substantial portion of it are not written in the first person.
LLM write in the first person because they have been specifically been finetuned for a chat task, it's not a fundamental feature of language models that would have to be specifically disallowed
While I'm skeptical of any "beats opus" claims (many were said, none turned out to be true), I still think it's insane that we can now run close-to-SotA models locally on ~100k worth of hardware, for a small team, and be 100% sure that the data stays local. Should be a no-brainer for teams that work in areas where privacy matters.
Even the smaller quantized models which can run on consumer hardware pack in an almost unfathomable amount of knowledge. I don't think I expected to be able to run a 'local Google' in my lifetime before the LLM boom.
I'm extremely curious how these models learn to pack a lossily-compressed representation of the entire Internet (more or less) into a few hundred billion parameters. like, what's the ontology?
I think this one is only about 600GB VRAM usage, so it could fit on two mac studios with 512GB vram each. That would have costed (albeit no longer available) something like less than 20k.
Yeah, but that's personal use at best, not much agentic anything happening on that hardware. Macs are great for small models at small-medium context lengths, but at > 64k (something very common with agentic usage) it struggles and slows down a lot.
The ~100k hardware is suitable for multi-user, small team usage. That's what you'd use for actual work in reasonable timeframes. For personal use, sure macs could work.
You could run it with SSD offload, earlier experiments with Kimi 2.5 on M5 hardware had it running at 2 tok/s. K2.6 has a similar amount of total and active parameters.
Yeah... I would definitely call 2t/s unusable. For simple chats, I'd want at least 15 t/s. For agentic coding (which this model is advertised for), I'd want good prefill performance as well.
That's just throwing money away. The performance with large context would have been unusable especially if you need to serve more then a single person.
> couldn’t be different than how Claude Code was received by software devs. It’s simply useless for designers, their workflow is very different from software devs. You can’t “oh let Claude Design come up with a quick logo for this” in the same way that Claude Code was able to quickly solve small annoyances for devs.
Haha, that's exactly how cc was received initially. It's just autocomplete. It's useless. It can't even x. I tried to y and it gave me z. Over and over all over the internet this was the reaction. Then the bargaining began. Oh, it will maybe speed up some simple things. Like autocomplete on steroids. Maaaybe do some junior tasks once in a while. And so on...
Agreed - For the last 20 years or so, designers at basecamp.com do all of their frontend design directly in rails/html/css and then have the developers "re-implement it". The upside of this approach is designs which really work in the browser and they found it to be faster. The downside of this approach is that it's harder to find designers who have both of those skills, but that was an acceptable tradeoff for them because they are a smaller run company.
To me, it seems obvious that AI will attack this from both directions - upskilling developers to make more design changes AND upskilling designers to make more design iterations and more changes to the codebase -- the design artifact is "new react components" (which can be re-implemented or not) instead of a figma design.
Most web design is already crap to begin with, so AI web-design will fit right in.
Plus compared to the totally open-ended video generation, web desisn is mostly samey (follows a few trends and conventions), way more restricted, and doesn't include difficult-to-recreate (due to uncanny valley effect) humans in it.
> Haha, that's exactly how cc was received initially.
Haha, maybe by you. By many on HN, but HN is a bubble of its own. By plenty of others it was received very differently. Many of us had been doing agentic coding for more than a year already when Claude Code was released, because we found it valuable.
We will see if such groups of professional designers also form for Claude Design or other such tools.
It's still an autocomplete on steroids (that's what LLMs are).
It still produces subpar code, with horrendous data access patterns, endless duplication of fucntionality etc. You still need a human in the loop to fix all the mistakes (unless you're Garry Tan or Steve Yegge who assume that quality is when you push hundreds of thousands of LoC per day).
Same here.
Oh, and Claude Code is significantly worse at generating design code than almost any other type of code.
Just because you don't look at the code, doesn't mean it doesn't produce subpar code constantly.
Opus 4.7, high effort. Literally 30 minutes ago. There's a `const UNMATCHED_MARKER = "<hardcoded value>"` that we want to remove from a file. Behold (the first version was a full on AST walk for absolutely no reason whatsoever, by the way):
Don't get me started on all the code duplications, utility functions written from scratch in every module that needs them, reading full database just to count number of records...
Unless I'm parsing your reply very badly, I see no world in which anything dealing with HTTP would be more expensive than dealing with kv cache (loading from "cold" storage, deciding which compute unit to load it into, doing the actual computations for the next call, etc).
No, that’s not the issue. What people fail to understand is that every request - eg every message you send, but also tool call responses - require the entire conversation history to be sent, and the LLM providers need to reprocess things.
The attention part of LLMs (that is, for every token, how much their attention is to all other tokens) is cached in a KV cache.
You can imagine that with large context windows, the overhead becomes enormous (attention has exponential complexity).
I remember seeing a yt video about this tech being already trialed (w/ regular lasers) for geothermal. They use lasers to "vaporise" rock, in the hopes of digging much more efficiently.
I c/p a section of Asimov's "The Last Question", since it was readily on the front page. It detected 14 patterns (2 reds, one yellow, bunch of green and blue) in 583 words. Welp, I guess it's back to school for mr. Asimov...
Update: 13 patterns in 800 words for Samuel Clemens. Apparently he's an em-dash abuser, but also likes "filler adverbs", "triple constructions" and "anaphora abuse". Damn!
And for Mr. Hemingway we have 43 patterns in 1600 words. 16 filler adverbs, 5 triple constructions, 5 staccato bursts, and 14 question then answer. My my...
Trying to limit / disallow something seems to be hurting the overall accuracy of models. And it makes sense if you think about it. Most of our long-horizon content is in the form of novels and above. If you're trying to clamp the machine to machine speak you'll lose all those learnings. Hero starts with a problem, hero works the problem, hero reaches an impasse, hero makes a choice, hero gets the princess. That can be (and probably is) useful.
reply