My guess is that the systems are running image recognition models, and maybe OCR on images, and then just piping that data as tokens into an LLM. So you are only ever going to get results as good as existing images models with the results filtered through an LLM.
To me, this is only interesting if compared with results of image recognition models that can already answer these types of questions (if they even exist, I haven't looked).
Maybe the service is smart enough to look at the question, and then choose one or more models to process the image, but not sure as I can't find anything on their sites about how it works.
To me, this is only interesting if compared with results of image recognition models that can already answer these types of questions (if they even exist, I haven't looked).
Maybe the service is smart enough to look at the question, and then choose one or more models to process the image, but not sure as I can't find anything on their sites about how it works.