Has anyone (beyond maybe self-driving software) tried using object tagging as a ...

craftinator · on June 10, 2020

I've been wondering these same thoughts for years. I don't do much work in the neural network subfield, but have done a lot with computer vision, and always found myself wanting more robust physical estimation techniques that didn't require external data.

joshvm · on June 10, 2020

RGB-D based semantic segmentation is certainly a thing. I'm sure it's also been done with video sequences as well.

jcims · on June 10, 2020

Yeah I wish the flagship phone manufacturers would put the hardware back into the phone to take 3d photos...even better if you can get point cloud data to go with it. The applications right now are kind of cheesy but they will get better and if the majority of photos taken pivot to including depth information i think it could really drive better capabilities from our phones.

Eyes are very hard to make and coordinate, yet there are almost no cyclops in nature.

joshvm · on June 11, 2020

In theory you could also do this with visual-inertial odometry eg monocular SLAM. But this is definitely something we're looking at in my group (I do CV for ecology), especially for object detection where geometry (absolute size) is a good way to distinguish between two confusing classes. A good candidate here is aerial imagery. If you've calibrated the camera and you know your altitude, then you know your ground sample distance (m/px).

Most flagships can do this though, any multicamera phone can get some kind of stereo. Google do it with the PDAF pixels for smart bokeh (they have some nice blog posts about it). I don't know if there is a way to so that in an API though (or to obtain the depth map).

https://ai.googleblog.com/2018/11/learning-to-predict-depth-...

jcims · on June 11, 2020

High resolution light field cameras would really help here as well. That seems a ways off though.

Are you folks able to do any multi-spectral stuff? That seems interesting.

joshvm · on June 11, 2020

I work mostly with RGB/Thermal, if that counts. My PhD was in stereo/lidar fusion, so I've always been into mixing sensors :)

I've also done some work on satellite imaging which is 13-band (Sentinel 2). Lots of people in ecology use the Parrot Sequoia which is four-band multispectral. There really isn't much published work in ML beyond RGB, which I find interesting - yes there's RGB-D and LIDAR but it's mostly for driving applications. Part of the reason I'm so familiar with the yolo codebases is that I've had to modify them a lot to work with non-standard data. There's nothing that stops you from using n-channel images, but you will almost certainly have to hack every off the shelf solution to make it work. RGB and 8-bit is almost always hard coded, augmentation also often fails with non RGB data (albumentations is good though). A bigger issue is there's a massive lack of good labelled datasets for non rgb imagery.

On the plus side, in a landscape where everyone is fighting over COCO, there is still a lot of low hanging fruit to pick I think.

I've not done any hyperspectral, very hard to (a) get labelled data (there's AVIRIS and EO-1/Hyperion maybe) (b) it's very hard to label, the images are enormous and (c) the cameras are stupid expensive.

By the way, even satellite imaging ML applications tend to overwhelmingly use just the RGB channels and not the full extent of the data.

jcims · on June 11, 2020

Whoa that's awesome! Love hearing contemporary technology used to detect/diagnose/monitor the environment and our ecological impact. Boots on ground will always be important but the horizontal scaling you can get out of imaging I would imagine really helps prioritize where you turn your attention. Thanks for the info and best of luck!