Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Llama-3.2-11B-Vision-Instruct does an excellent job extracting/answering questions from screenshots. It is even able to answer questions based on information buried inside a flowchart. How is this even possible??


Because they trained the text model. Then froze the weights. Then trained a vision model on text image pairs of progressively higher quality. Then trained an adapter to align their latent spaces. So it became smart on text then gain a new input sense magically without changing its weights


Is this - at a reasonable guess - what most believe OpenAI did with 4o?


Oh, this is promising. It's not surprising to me: image models have been very oriented towards photography and scene understanding rather than understanding symbolic information in images (like text or diagrams), but I always thought that it should be possible to make the model better at the latter, for instance by training it more on historical handwritten documents.


How good it is at comic reading?


magic




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: