Llama-3.2-11B-Vision-Instruct does an excellent job extracting/answering questio...

Ey7NFZ3P0nzAe · on Sept 26, 2024

Because they trained the text model. Then froze the weights. Then trained a vision model on text image pairs of progressively higher quality. Then trained an adapter to align their latent spaces. So it became smart on text then gain a new input sense magically without changing its weights

ComputerGuru · on Sept 26, 2024

Is this - at a reasonable guess - what most believe OpenAI did with 4o?

vintermann · on Sept 26, 2024

Oh, this is promising. It's not surprising to me: image models have been very oriented towards photography and scene understanding rather than understanding symbolic information in images (like text or diagrams), but I always thought that it should be possible to make the model better at the latter, for instance by training it more on historical handwritten documents.

faangguyindia · on Sept 26, 2024

How good it is at comic reading?

bboygravity · on Sept 26, 2024

magic