Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

This paper examines the limitations of current vision-based language models, such as GPT-4 and Sonnet 3.5, in performing low-level vision tasks. Despite their high scores on numerous multimodal benchmarks, these models often fail on very basic cases. This raises a crucial question: are we evaluating these models accurately?


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: