We could have done a better job of explaining CLIP in this post, but we've written about it quite extensively before[1][2][3][4][5] if you're interested. I'm convinced CLIP is the most important advancement in machine learning in the past year; it's incredibly versatile.
In a nutshell, CLIP is a multimodal model that combines knowledge of English-language concepts with semantic knowledge of images. It can just as easily distinguish between an image of a "cat" and a "dog" as it can between "an illustration of Deadpool pretending to be a bunny rabbit"[6] and "an underwater scene in the style of Vincent Van Gogh"[7] and any other concept you can come up with[8] (even though it has definitely never seen those things in its training data[9]).
This is how these CLIP+VQGAN notebooks can create such a symphony of artistic renderings[10] (CLIP steers the GANs towards its interpretation of any English string that the artist can imagine).
Thanks, but I think that lucb1e's confusion was probably the same as mine -- given pretrained CLIP features, how is this translated to zero-shot tracking?
Are initial bounding boxes given as usual, or are objects of interest created automagically?
It uses an object detection model (in our example code[1], we used one from Roboflow Universe[2] but you should be able to use any object detection model) to get the bounding boxes and then sends a crop of each detected box to CLIP to get the feature vector that Deep SORT uses to differentiate between and track instances across frames.
This is in comparison to the original Deep SORT[3] which requires you to train a second custom "deep appearance descriptor" model for the tracker to use.
Can someone explain to me how saying “we utilize CLIP” and “we are zero-shot” aren’t contradictory? You’d be hard pressed to come up with a class of things from reality or popular fiction that CLIP did not see a few shot of.
Zero-shot means it can generalize to things it's never seen before.
You don't need to train CLIP on any examples of "The stacks from Ready Player One drawn by MC Escher" to have it do a pretty good job of being able to judge which outputs of a GAN more closely resemble that concept[1].
Saying this isn't "zero-shot" because CLIP has been trained on a corpus (the Internet) that had some info about the setting of the movie Ready Player One and the artist MC Escher isn't a particularly useful interpretation. By the same reasoning, humans aren't capable of "zero-shot" classification either because they have prior knowledge of real-world concepts.
The model does object tracking well whether you feed it fish, or playing cards, or sperm because the generalized feature vector it extracts represent all the possible properties. It "knows" (and uses) properties like "facing right", "profile view", "blue sheen", "foggy", etc. to make its determination of which instances are which.
CLIP has seen zero segmentations or bounding-boxed labeled examples of any object classes. Pure image-level contrastive learning against (extremely) noisy natural-language text captions.
In a nutshell, CLIP is a multimodal model that combines knowledge of English-language concepts with semantic knowledge of images. It can just as easily distinguish between an image of a "cat" and a "dog" as it can between "an illustration of Deadpool pretending to be a bunny rabbit"[6] and "an underwater scene in the style of Vincent Van Gogh"[7] and any other concept you can come up with[8] (even though it has definitely never seen those things in its training data[9]).
This is how these CLIP+VQGAN notebooks can create such a symphony of artistic renderings[10] (CLIP steers the GANs towards its interpretation of any English string that the artist can imagine).
[1] ELI5 CLIP: https://blog.roboflow.com/clip-model-eli5-beginner-guide/
[2] How to Try CLIP: https://blog.roboflow.com/how-to-use-openai-clip/
[3] Content Moderation with CLIP: https://blog.roboflow.com/zero-shot-content-moderation-opena...
[4] CLIP Prompt Engineering: https://blog.roboflow.com/openai-clip-prompt-engineering/
[5] CLIP for Semantic Image Similarity: https://blog.roboflow.com/apples-csam-neuralhash-collision/
[6] Deadpool Pretending to be a Bunny Rabbit: https://paint.wtf/ranking/erFA7/uCzEnPBtKkgoBAhBesQY
[7] Underwater Van Gogh: https://paint.wtf/ranking/xboNk/0fjxp0kEi0WeDgkP3LZX
[8] Other human drawings as judged by CLIP: https://paint.wtf/leaderboard
[9] CLIP-judged Pictionary game: https://blog.roboflow.com/how-we-built-paint-wtf-an-ai-that-...
[10] AI Generated Art with CLIP+VQGAN: https://blog.roboflow.com/ai-generated-art/