Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Has anyone (beyond maybe self-driving software) tried using object tagging as a way to start introducing physics into a scene? E.g. human and bicycle have same motion vector, increases likelihood that human is riding bicycle. Bicycle and human have size and weight ranges that could be used to plot trajectory. Bicycles riding in a straight line and trees both provide some cues as to the gravity vector in the scene. Etc. etc.

Seems like the camera motion is probably already solved with optical flow/photogrammetry stuff, but you might be able to use that to help scale the scene and start filtering your tagging based on geometric likelihood.

The idea of hierarchical reference frames (outlined a bit by Jeff Hawkins here https://www.youtube.com/watch?v=-EVqrDlAqYo&t=3025 ) seems pretty compelling to me for contextualizing scenes to gain comprehension. Particularly if you build a graph from those reference frames and situate models tuned to the type of object at the root of each each frame (vertex). You could use that to help each model learn, too. So if a bike model projects a 'riding' edge towards the 'person' model, there wouldn't likely be much learning. e.g. [Person]-(rides)->[Bike] would have likely been encountered already.

However if the [Bike] projects the (rides) edge towards the [Capuchin] sitting in the seat, the [Capuchin] model might learn that capuchins can (ride) and furthermore they can (ride) a [Bike].



I've been wondering these same thoughts for years. I don't do much work in the neural network subfield, but have done a lot with computer vision, and always found myself wanting more robust physical estimation techniques that didn't require external data.


RGB-D based semantic segmentation is certainly a thing. I'm sure it's also been done with video sequences as well.


Yeah I wish the flagship phone manufacturers would put the hardware back into the phone to take 3d photos...even better if you can get point cloud data to go with it. The applications right now are kind of cheesy but they will get better and if the majority of photos taken pivot to including depth information i think it could really drive better capabilities from our phones.

Eyes are very hard to make and coordinate, yet there are almost no cyclops in nature.


In theory you could also do this with visual-inertial odometry eg monocular SLAM. But this is definitely something we're looking at in my group (I do CV for ecology), especially for object detection where geometry (absolute size) is a good way to distinguish between two confusing classes. A good candidate here is aerial imagery. If you've calibrated the camera and you know your altitude, then you know your ground sample distance (m/px).

Most flagships can do this though, any multicamera phone can get some kind of stereo. Google do it with the PDAF pixels for smart bokeh (they have some nice blog posts about it). I don't know if there is a way to so that in an API though (or to obtain the depth map).

https://ai.googleblog.com/2018/11/learning-to-predict-depth-...


High resolution light field cameras would really help here as well. That seems a ways off though.

Are you folks able to do any multi-spectral stuff? That seems interesting.


I work mostly with RGB/Thermal, if that counts. My PhD was in stereo/lidar fusion, so I've always been into mixing sensors :)

I've also done some work on satellite imaging which is 13-band (Sentinel 2). Lots of people in ecology use the Parrot Sequoia which is four-band multispectral. There really isn't much published work in ML beyond RGB, which I find interesting - yes there's RGB-D and LIDAR but it's mostly for driving applications. Part of the reason I'm so familiar with the yolo codebases is that I've had to modify them a lot to work with non-standard data. There's nothing that stops you from using n-channel images, but you will almost certainly have to hack every off the shelf solution to make it work. RGB and 8-bit is almost always hard coded, augmentation also often fails with non RGB data (albumentations is good though). A bigger issue is there's a massive lack of good labelled datasets for non rgb imagery.

On the plus side, in a landscape where everyone is fighting over COCO, there is still a lot of low hanging fruit to pick I think.

I've not done any hyperspectral, very hard to (a) get labelled data (there's AVIRIS and EO-1/Hyperion maybe) (b) it's very hard to label, the images are enormous and (c) the cameras are stupid expensive.

By the way, even satellite imaging ML applications tend to overwhelmingly use just the RGB channels and not the full extent of the data.


Whoa that's awesome! Love hearing contemporary technology used to detect/diagnose/monitor the environment and our ecological impact. Boots on ground will always be important but the horizontal scaling you can get out of imaging I would imagine really helps prioritize where you turn your attention. Thanks for the info and best of luck!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: