I am trying to port a Molecular Dynamics package to GPU.
I am done with the coding. When I run my code with deviceemu option, it works fine and the results are correct.
But when I try to run without that option, the results are not correct. I believe this is the problem with how I am using cudaThreadSynchronize().
Are there any thumb rules where to use it. How can I find where am I wrong.
Or is any other way to find out what can I do.
For what is worth, the number of threads are less than the number of n bodies.
The rule of thumb is: cudaThreadSynchronize() is almost always unnecessary for correct operation. All it does is wait for the kernel to complete before continuing. However, cudaMemcpy() also does this before performing the memory copy, so you don’t need cudaThreadSynchronize() before it.
Calling cudaThreadSynchronize() after a kernel is useful for two things:
Correct timing of the run time of the kernel. Start the timer before the kernel, run kernel, run cudaThreadSynchronize(), then stop the timer.
Checking error codes generated by the kernel before moving on. By waiting until the kernel is finished, cudaThreadSynchronize() will return any errors generated during execution.
If your code works in emulation mode but not on the device, then it usually means: kernel failing to launch, kernel crashing during execution (due to host pointers, etc), or a race condition in your kernel code. The first two problems can be checked for by calling cudaThreadSynchronize() after the kernel and looking at the return code. The last problem is much harder to solve. You have to read your code carefully and look at places where two different threads can read and write to the same memory locations.
This error is like the CUDA version of a segmentation fault. It is generally associated with some kind of memory access error, possibly due to passing a host pointer to the device.
If you want to pass a structure to the device with a pointer, you need to allocate memory on the device with cudaMalloc(), then copy your host structure to the device (using the pointer from cudaMalloc) with cudaMemcpy(). Then, if your kernel makes any changes to the structure and you want them back on the host, you will need to cudaMemcpy() in the reverse direction (device to host) after the kernel.
What are the contents of your struct? There are other things to consider depending on what (and how much) you are trying to move to the device.
Basically what I am doing is, I am calculating Potential of atoms due to all other atoms using the kernel, thus, I am writing the results to the structure that I am passing. And after that, from this structure, I am copying the results back to a result array in the host.
The Structure which I am passing contains 10 differents arrays and 4 variables.