Detecting stacking of products

I am trying to build an object recognition system for a retail store. I have built a network using Yolo algorithm and Resnet backbone and I can successfully identify products in a frame. However, I want to also identify and count products when they are stacked either vertically or when placed behind one another.

So as we can see in the images, the system detects kettle chips and too yumm chips when placed together, but two packets of too yumm chips/buttermilk/coke are identified as a single product. How can I differentiate these as two separate products along the edge? Basically I want the model to output two separate bounding boxes even when they are placed close to each other.

It depends on more training dataset which contains chips/buttermilk/coke close to each other.
Adding these images and labels, then trigger training.

Do you think training using Mask RCNN would help in this problem?

If you want to use Maskrcnn, please try to use faster-rcnn to train at first to see what is its result.

More, actually try a bigger backone with yolo, for example, if you were using resnet10, then just try to use resnet18.

Sure, thank you. Will try.