We could have done a better job of explaining CLIP in this post, but we've writt...

jotaf · on Aug 28, 2021

Thanks, but I think that lucb1e's confusion was probably the same as mine -- given pretrained CLIP features, how is this translated to zero-shot tracking?

Are initial bounding boxes given as usual, or are objects of interest created automagically?

Or are they tracked just from text descriptions?

Lots of questions after reading the post :)

yeldarb · on Aug 28, 2021

It uses an object detection model (in our example code[1], we used one from Roboflow Universe[2] but you should be able to use any object detection model) to get the bounding boxes and then sends a crop of each detected box to CLIP to get the feature vector that Deep SORT uses to differentiate between and track instances across frames.

This is in comparison to the original Deep SORT[3] which requires you to train a second custom "deep appearance descriptor" model for the tracker to use.

[1] https://github.com/roboflow-ai/zero-shot-object-tracking

[2] https://universe.roboflow.com

[3] https://github.com/nwojke/deep_sort

corysama · on Aug 28, 2021

Can someone explain to me how saying “we utilize CLIP” and “we are zero-shot” aren’t contradictory? You’d be hard pressed to come up with a class of things from reality or popular fiction that CLIP did not see a few shot of.

yeldarb · on Aug 28, 2021

Zero-shot means it can generalize to things it's never seen before.

You don't need to train CLIP on any examples of "The stacks from Ready Player One drawn by MC Escher" to have it do a pretty good job of being able to judge which outputs of a GAN more closely resemble that concept[1].

Saying this isn't "zero-shot" because CLIP has been trained on a corpus (the Internet) that had some info about the setting of the movie Ready Player One and the artist MC Escher isn't a particularly useful interpretation. By the same reasoning, humans aren't capable of "zero-shot" classification either because they have prior knowledge of real-world concepts.

The model does object tracking well whether you feed it fish, or playing cards, or sperm because the generalized feature vector it extracts represent all the possible properties. It "knows" (and uses) properties like "facing right", "profile view", "blue sheen", "foggy", etc. to make its determination of which instances are which.

[1] https://twitter.com/braddwyer/status/1430576141572657162

gwern · on Aug 28, 2021

CLIP has seen zero segmentations or bounding-boxed labeled examples of any object classes. Pure image-level contrastive learning against (extremely) noisy natural-language text captions.