My guess is that the systems are running image recognition models, and maybe OCR... | Hacker News

Hacker Newsnew | past | comments | ask | show | jobs | submit

		gnutrino on July 10, 2024 \| parent \| context \| favorite \| on: Vision language models are blind My guess is that the systems are running image recognition models, and maybe OCR on images, and then just piping that data as tokens into an LLM. So you are only ever going to get results as good as existing images models with the results filtered through an LLM. To me, this is only interesting if compared with results of image recognition models that can already answer these types of questions (if they even exist, I haven't looked). Maybe the service is smart enough to look at the question, and then choose one or more models to process the image, but not sure as I can't find anything on their sites about how it works.

simonw on July 10, 2024 | [–]

That's not how they work. The original GPT-4 paper has some detail: https://cdn.openai.com/papers/gpt-4.pdf

Or read up on PaliGemma: https://github.com/google-research/big_vision/blob/main/big_...

gnutrino on July 11, 2024 | | [–]

Thanks, I'll read up on this.

Eisenstein on July 10, 2024 | [–]

> My guess is that the systems are running image recognition models

Your guess is incorrect. Look up CLIP, BLIP, and SigLip for an idea of how they work.

gnutrino on July 11, 2024 | [–]

Will do, thank you.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact