This looks interesting although even in their sample video you can see a lot of tracking loss and reidentification events.
I do wonder how it would perform in more difficult scenarios with problems like partial obstruction, crowding and changing object scale (e.g. objects moving away from the camera).
I implemented a production object tracking (using tracking by detection) system and these were the main challenges I encountered.
I know what object tracking is, but how can you do it with zero shots of the object? Curious what this name refers to, but it's quite buried. From the blog post, click through to the repository, then click in the readme on OpenAI's CLIP zero-shot image classifier to find:
> "Zero-shot learning" is when a model attempts to predict a class it saw zero times in the training data. So, using a model trained on exclusively cats and dogs to then detect raccoons.
Wikipedia on ZSL gives this example:
> For example, given a set of images of animals to be classified, along with auxiliary textual descriptions of what animals look like, an AI which has been trained to recognize horses, but has never seen a zebra, can still recognize a zebra if it also knows that zebras look like striped horses.
Coming back to the blog post with this understanding, this part is interesting:
> The breakthrough in our zero shot object tracking repository is to use generalized CLIP object features, eliminating the need for you to make additional object track annotations
Where "CLIP is a neural network trained on a variety of (image, text) pairs." (From the first source again, so CLIP is a pre-trained model, not the algorithm that you need to make the model.)
Looking a bit at the CLIP repository from openai:
> CLIP matches the performance of the original ResNet50 on ImageNet “zero-shot” without using any of the original 1.28M labeled examples
Here it says they don't use any labeled examples? So it just knows that a fish is called fish? This field is confusing and such a rabbit hole ^^
We could have done a better job of explaining CLIP in this post, but we've written about it quite extensively before[1][2][3][4][5] if you're interested. I'm convinced CLIP is the most important advancement in machine learning in the past year; it's incredibly versatile.
In a nutshell, CLIP is a multimodal model that combines knowledge of English-language concepts with semantic knowledge of images. It can just as easily distinguish between an image of a "cat" and a "dog" as it can between "an illustration of Deadpool pretending to be a bunny rabbit"[6] and "an underwater scene in the style of Vincent Van Gogh"[7] and any other concept you can come up with[8] (even though it has definitely never seen those things in its training data[9]).
This is how these CLIP+VQGAN notebooks can create such a symphony of artistic renderings[10] (CLIP steers the GANs towards its interpretation of any English string that the artist can imagine).
Thanks, but I think that lucb1e's confusion was probably the same as mine -- given pretrained CLIP features, how is this translated to zero-shot tracking?
Are initial bounding boxes given as usual, or are objects of interest created automagically?
It uses an object detection model (in our example code[1], we used one from Roboflow Universe[2] but you should be able to use any object detection model) to get the bounding boxes and then sends a crop of each detected box to CLIP to get the feature vector that Deep SORT uses to differentiate between and track instances across frames.
This is in comparison to the original Deep SORT[3] which requires you to train a second custom "deep appearance descriptor" model for the tracker to use.
Can someone explain to me how saying “we utilize CLIP” and “we are zero-shot” aren’t contradictory? You’d be hard pressed to come up with a class of things from reality or popular fiction that CLIP did not see a few shot of.
Zero-shot means it can generalize to things it's never seen before.
You don't need to train CLIP on any examples of "The stacks from Ready Player One drawn by MC Escher" to have it do a pretty good job of being able to judge which outputs of a GAN more closely resemble that concept[1].
Saying this isn't "zero-shot" because CLIP has been trained on a corpus (the Internet) that had some info about the setting of the movie Ready Player One and the artist MC Escher isn't a particularly useful interpretation. By the same reasoning, humans aren't capable of "zero-shot" classification either because they have prior knowledge of real-world concepts.
The model does object tracking well whether you feed it fish, or playing cards, or sperm because the generalized feature vector it extracts represent all the possible properties. It "knows" (and uses) properties like "facing right", "profile view", "blue sheen", "foggy", etc. to make its determination of which instances are which.
CLIP has seen zero segmentations or bounding-boxed labeled examples of any object classes. Pure image-level contrastive learning against (extremely) noisy natural-language text captions.
Problem areas are usually when objects cross over each other, and when the tracked object starts moving at a faster speed (eg: sudden speed up in runners, or so on).
I believe DLib's tracker is based on tracking hand-made features - probably Cmashift or similar (similar to HOG features on their non-CNN face detector). I'd note that it hasn't been updated sine 2018 as well.
That means the accuracy will be a lot lower - although it will be fast.
But the Roboflow tracker is using OpenAI Clip, which seems a complete overkill and will be pretty slow!
There's also Siamese Networks, they're pretty fast and simple, and people keep updating their backbone CNNs to more modern (computationally expensive) ones.
I do wonder how it would perform in more difficult scenarios with problems like partial obstruction, crowding and changing object scale (e.g. objects moving away from the camera).
I implemented a production object tracking (using tracking by detection) system and these were the main challenges I encountered.
Still a very interesting looking result though:)