I'd do the transcript and the summary parts separately. Dedicated audio models f... | Hacker News

Hacker Newsnew | past | comments | ask | show | jobs | submit

		satvikpendem 5 months ago \| parent \| context \| favorite \| on: Gemini 3 I'd do the transcript and the summary parts separately. Dedicated audio models from vendors like ElevenLabs or Soniox use speaker detection models to produce an accurate speaker based transcript while I'm not necessarily sure that Google's models do so, maybe they just hallucinate the speakers instead.

trvz 5 months ago [–]

Agreed. I don’t see the need for Gemini to be able to do this task, although it should be able to offload it to another model.

Consider applying for YC's Summer 2026 batch! Applications are open till May 4
Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact