VimGPT couples the perception to a specific LLM/agent whereas Tarsier is solely a perception system that you can use for any uni/multi-modal web agent. So it's hard to compare, but you could say that VimGPT's performance probably lies somewhere in the middle of Tarsier's performance distribution (which varies as a function of your specific agent/prompt system).