Hi! I am new to CUDA and I am trying to write a 3D correlation code. The code works fine for small array sizes but returns zeros if I increase the array size. At first I thought I might be running out of device memory but I do not get any out of memory error or something like that.
e.g. If grid size is 646464, I get correct result but if I increase the grid size to say 128128128, it generates all zeros.
The code runs fine in Emulation mode for any grid size
In the current version, all the threads are accessing the arrays from global memory (where they are allocated as 1D array). I know that has a huge performance hit but this is just the initial version and I want to get it right first.
Any help would be greatly appreciated.