That's an interesting problem—Tarsier probably isn't the best solution here since it's focused on webpage perception rather than any kind of OCR. But one could try adapting the `format_text` function in tarsier/text_format.py to convert any set of OCR annotations to a whitespace-structured string. Curious to see if that works.