Program returns zero for large array

Hi! I am new to CUDA and I am trying to write a 3D correlation code. The code works fine for small array sizes but returns zeros if I increase the array size. At first I thought I might be running out of device memory but I do not get any out of memory error or something like that.

e.g. If grid size is 646464, I get correct result but if I increase the grid size to say 128128128, it generates all zeros.

The code runs fine in Emulation mode for any grid size

In the current version, all the threads are accessing the arrays from global memory (where they are allocated as 1D array). I know that has a huge performance hit but this is just the initial version and I want to get it right first.

Any help would be greatly appreciated.

Thanks,
B.S.

check the error code for your kernel launch. you’re probably accessing bad memory addresses–you could try running valgrind on your deviceemu version.

Thanks for your reply.

I do not get any error code.

Another thing. I ran the same program today. The time taken by the GPU to complete the task was different (much lower) when I ran it today than it took last night. This is true for both the case of small arrays and the large arrays.

Any clue?

Thanks,
B.S.