Quoting from GP of your original reply:

> If you want to recognize all Coke cans in all fridges, for your real-world, consumer-ready Coke-fetching robot product?

If you're stuck with a mono dataset, post collection, then sure use NN and call it a day. But even if you have video you can do 3D reconstruction just from baseline movement. You won't know scale, so you can't differentiate between big coke cans and little coke cans, but at least you can rule out pictures of coke cans.