I’m doing a program to calculate forces between particles (molecular simulation). The kernel works ok for n < 256 (n = number of particles and is a multiple of 2) by compared with the same calculus by the CPU, but for n >= 256 I see a stranger behavior of the program: I have to run the program 2 or 3 times to get the same results between GPU and CPU , and when n is too long (approx 2048) I definitively can’t get the same result… finally I only see zeros in the force calculate from the GPU.
I checked the amount of memory but the device (Geforce GTX 550 Ti) have a lot of global memory available. Anyone know that could be the problem? I have a problem in the transfer and copy of memory between Device and HOST? I’m reading the programming guide, but I haven’t find the answer. I use cudaMalloc() and cudaMemCpy() functions to allocate and copy data.
The kernel use a grid of block, so the index i = threadIdx.x and j = threadIdx.y run for 0 to n-1. A pointer of size nn saves the pair forces (between particles i and j) with a index k = i + jn. First the kernel calculate all the pair forces, synchronized the threads and then performance a sum over j to calculate the force for the particles j. The sum is done by reduction (for this reason n is multiple of 2).