Hi,
I am developing a CUDA enabled image segmentation system.
My kernel (Expectation stage of the EM algorithm) crashes using a grid {x=14, y=11, z=1} but when I pad out the image appropriately, it works fine with grid {x=16, y=16, z=1}.
The thread block is 8x8x1 and the shared memory requirement is 1.5Kb per thread block.
I originally believed the problem was being caused by going over resource limit (shared memory), but if it works with MORE thread blocks then surely this can’t be the case? Could it be something to do with how the thread blocks are scheduled onto multiprocessors depending on grid dimensions?
Any help greatly appreciated!
Damian