Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

My guess is that the systems are running image recognition models, and maybe OCR on images, and then just piping that data as tokens into an LLM. So you are only ever going to get results as good as existing images models with the results filtered through an LLM.

To me, this is only interesting if compared with results of image recognition models that can already answer these types of questions (if they even exist, I haven't looked).

Maybe the service is smart enough to look at the question, and then choose one or more models to process the image, but not sure as I can't find anything on their sites about how it works.



That's not how they work. The original GPT-4 paper has some detail: https://cdn.openai.com/papers/gpt-4.pdf

Or read up on PaliGemma: https://github.com/google-research/big_vision/blob/main/big_...


Thanks, I'll read up on this.


> My guess is that the systems are running image recognition models

Your guess is incorrect. Look up CLIP, BLIP, and SigLip for an idea of how they work.


Will do, thank you.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: