Zero Shot Object Tracking

jeeeb · on Aug 28, 2021

This looks interesting although even in their sample video you can see a lot of tracking loss and reidentification events.

I do wonder how it would perform in more difficult scenarios with problems like partial obstruction, crowding and changing object scale (e.g. objects moving away from the camera).

I implemented a production object tracking (using tracking by detection) system and these were the main challenges I encountered.

Still a very interesting looking result though:)

lucb1e · on Aug 28, 2021

I know what object tracking is, but how can you do it with zero shots of the object? Curious what this name refers to, but it's quite buried. From the blog post, click through to the repository, then click in the readme on OpenAI's CLIP zero-shot image classifier to find:

> "Zero-shot learning" is when a model attempts to predict a class it saw zero times in the training data. So, using a model trained on exclusively cats and dogs to then detect raccoons.

Wikipedia on ZSL gives this example:

> For example, given a set of images of animals to be classified, along with auxiliary textual descriptions of what animals look like, an AI which has been trained to recognize horses, but has never seen a zebra, can still recognize a zebra if it also knows that zebras look like striped horses.

Coming back to the blog post with this understanding, this part is interesting:

> The breakthrough in our zero shot object tracking repository is to use generalized CLIP object features, eliminating the need for you to make additional object track annotations

Where "CLIP is a neural network trained on a variety of (image, text) pairs." (From the first source again, so CLIP is a pre-trained model, not the algorithm that you need to make the model.)

Looking a bit at the CLIP repository from openai:

> CLIP matches the performance of the original ResNet50 on ImageNet “zero-shot” without using any of the original 1.28M labeled examples

Here it says they don't use any labeled examples? So it just knows that a fish is called fish? This field is confusing and such a rabbit hole ^^

yeldarb · on Aug 28, 2021

We could have done a better job of explaining CLIP in this post, but we've written about it quite extensively before[1][2][3][4][5] if you're interested. I'm convinced CLIP is the most important advancement in machine learning in the past year; it's incredibly versatile.

In a nutshell, CLIP is a multimodal model that combines knowledge of English-language concepts with semantic knowledge of images. It can just as easily distinguish between an image of a "cat" and a "dog" as it can between "an illustration of Deadpool pretending to be a bunny rabbit"[6] and "an underwater scene in the style of Vincent Van Gogh"[7] and any other concept you can come up with[8] (even though it has definitely never seen those things in its training data[9]).

This is how these CLIP+VQGAN notebooks can create such a symphony of artistic renderings[10] (CLIP steers the GANs towards its interpretation of any English string that the artist can imagine).

[1] ELI5 CLIP: https://blog.roboflow.com/clip-model-eli5-beginner-guide/

[2] How to Try CLIP: https://blog.roboflow.com/how-to-use-openai-clip/

[3] Content Moderation with CLIP: https://blog.roboflow.com/zero-shot-content-moderation-opena...

[4] CLIP Prompt Engineering: https://blog.roboflow.com/openai-clip-prompt-engineering/

[5] CLIP for Semantic Image Similarity: https://blog.roboflow.com/apples-csam-neuralhash-collision/

[6] Deadpool Pretending to be a Bunny Rabbit: https://paint.wtf/ranking/erFA7/uCzEnPBtKkgoBAhBesQY

[7] Underwater Van Gogh: https://paint.wtf/ranking/xboNk/0fjxp0kEi0WeDgkP3LZX

[8] Other human drawings as judged by CLIP: https://paint.wtf/leaderboard

[9] CLIP-judged Pictionary game: https://blog.roboflow.com/how-we-built-paint-wtf-an-ai-that-...

[10] AI Generated Art with CLIP+VQGAN: https://blog.roboflow.com/ai-generated-art/

jotaf · on Aug 28, 2021

Thanks, but I think that lucb1e's confusion was probably the same as mine -- given pretrained CLIP features, how is this translated to zero-shot tracking?

Are initial bounding boxes given as usual, or are objects of interest created automagically?

Or are they tracked just from text descriptions?

Lots of questions after reading the post :)

yeldarb · on Aug 28, 2021

It uses an object detection model (in our example code[1], we used one from Roboflow Universe[2] but you should be able to use any object detection model) to get the bounding boxes and then sends a crop of each detected box to CLIP to get the feature vector that Deep SORT uses to differentiate between and track instances across frames.

This is in comparison to the original Deep SORT[3] which requires you to train a second custom "deep appearance descriptor" model for the tracker to use.

[1] https://github.com/roboflow-ai/zero-shot-object-tracking

[2] https://universe.roboflow.com

[3] https://github.com/nwojke/deep_sort

corysama · on Aug 28, 2021

Can someone explain to me how saying “we utilize CLIP” and “we are zero-shot” aren’t contradictory? You’d be hard pressed to come up with a class of things from reality or popular fiction that CLIP did not see a few shot of.

yeldarb · on Aug 28, 2021

Zero-shot means it can generalize to things it's never seen before.

You don't need to train CLIP on any examples of "The stacks from Ready Player One drawn by MC Escher" to have it do a pretty good job of being able to judge which outputs of a GAN more closely resemble that concept[1].

Saying this isn't "zero-shot" because CLIP has been trained on a corpus (the Internet) that had some info about the setting of the movie Ready Player One and the artist MC Escher isn't a particularly useful interpretation. By the same reasoning, humans aren't capable of "zero-shot" classification either because they have prior knowledge of real-world concepts.

The model does object tracking well whether you feed it fish, or playing cards, or sperm because the generalized feature vector it extracts represent all the possible properties. It "knows" (and uses) properties like "facing right", "profile view", "blue sheen", "foggy", etc. to make its determination of which instances are which.

[1] https://twitter.com/braddwyer/status/1430576141572657162

gwern · on Aug 28, 2021

CLIP has seen zero segmentations or bounding-boxed labeled examples of any object classes. Pure image-level contrastive learning against (extremely) noisy natural-language text captions.

forgingahead · on Aug 28, 2021

This is cool, though I wonder how much more accurate this is compared to dlib's built-in Correlation Tracker: http://dlib.net/correlation_tracker.py.html

Problem areas are usually when objects cross over each other, and when the tracked object starts moving at a faster speed (eg: sudden speed up in runners, or so on).

*Edit: There's a good gif of dlib's implementation here: https://www.pyimagesearch.com/2018/10/22/object-tracking-wit...

yeldarb · on Aug 28, 2021

We haven’t had a chance to run it through eval on standard datasets yet to see how it compares quantitatively. I’d like to compare it to some of these: https://paperswithcode.com/task/multi-object-tracking

The code is available here if anyone wants to give it a go before we can get to it: https://github.com/roboflow-ai/zero-shot-object-tracking

nl · on Aug 28, 2021

I believe DLib's tracker is based on tracking hand-made features - probably Cmashift or similar (similar to HOG features on their non-CNN face detector). I'd note that it hasn't been updated sine 2018 as well.

That means the accuracy will be a lot lower - although it will be fast.

But the Roboflow tracker is using OpenAI Clip, which seems a complete overkill and will be pretty slow!

Something using a lightweight object detector should be much faster. This post shows how to use YOLO, which is the first things I'd try would be: https://github.com/yehengchen/Object-Detection-and-Tracking

jotaf · on Aug 28, 2021

There's also Siamese Networks, they're pretty fast and simple, and people keep updating their backbone CNNs to more modern (computationally expensive) ones.