Kernel crashes with smaller grid size


I am developing a CUDA enabled image segmentation system.

My kernel (Expectation stage of the EM algorithm) crashes using a grid {x=14, y=11, z=1} but when I pad out the image appropriately, it works fine with grid {x=16, y=16, z=1}.

The thread block is 8x8x1 and the shared memory requirement is 1.5Kb per thread block.

I originally believed the problem was being caused by going over resource limit (shared memory), but if it works with MORE thread blocks then surely this can’t be the case? Could it be something to do with how the thread blocks are scheduled onto multiprocessors depending on grid dimensions?

Any help greatly appreciated!

You are probably writing into memory you have not allocated. You can either

  • run your code under valgrind in emulation mode (linux only)
  • check how you calculate your indices
  • post your code (kernel & calling part, including mem-allocation)

Thanks DenisR, you were correct, it was causing writes to memory that wasn’t allocated.

I allocated memory of size numThreadBlocks * sizeOf(returnStruct) to allow each threadblock to write a struct. I guess the array of structs take up more space than that in device memory due to however they are packed.