Llama-3.2-11B-Vision-Instruct does an excellent job extracting/answering questions from screenshots. It is even able to answer questions based on information buried inside a flowchart. How is this even possible??
Because they trained the text model. Then froze the weights. Then trained a vision model on text image pairs of progressively higher quality. Then trained an adapter to align their latent spaces. So it became smart on text then gain a new input sense magically without changing its weights
Oh, this is promising. It's not surprising to me: image models have been very oriented towards photography and scene understanding rather than understanding symbolic information in images (like text or diagrams), but I always thought that it should be possible to make the model better at the latter, for instance by training it more on historical handwritten documents.