Gemini Flash 2.5 lite does 400 tokens/sec. Is there benefit to going faster than...

atls · 2026-02-20T14:53:00 1771599180

There is also the use case of delegating tasks programmatically to an LLM, for example, transforming unstructured data to structured data. This task often can’t be done reliably without either 1. lots of manual work, or 2. intelligence, especially when the structure of the individual data pieces are unknown. Problems like these can be much more efficiently solved by LLMs, and if you imagine these programs are processing very large datasets, then sub-millisecond inference is crucial.

xnx · 2026-02-20T15:17:37 1771600657

Aren't such tasks inherently parrallelizable?

booli · 2026-02-20T13:45:59 1771595159

Agents also "read", so yes there is. Think about spinning up 10, 20, 100 sub agents for a small task and they all return near instant. That's the usecase, not the chatbot.

xi_studio · 2026-02-20T15:55:15 1771602915

Agents already bypass human inference time, if it can input-output instantly it can also loop it generating near instantly long cached tasks

cheema33 · 2026-02-20T14:09:05 1771596545

Yes. You can allow multiple people to use a single chip. A slower solution will be able to service far fewer users.

xnx · 2026-02-20T14:27:55 1771597675

Right, but it is also possible it's cheaper to use 42 Google TPUs for a second than one of these.