Synchronization synchronizing a n body problem.


I am trying to port a Molecular Dynamics package to GPU.
I am done with the coding. When I run my code with deviceemu option, it works fine and the results are correct.

But when I try to run without that option, the results are not correct. I believe this is the problem with how I am using cudaThreadSynchronize().
Are there any thumb rules where to use it. How can I find where am I wrong.

Or is any other way to find out what can I do.

For what is worth, the number of threads are less than the number of n bodies.

Please help.

How exactly are the results from the device “not correct”?

The rule of thumb is: cudaThreadSynchronize() is almost always unnecessary for correct operation. All it does is wait for the kernel to complete before continuing. However, cudaMemcpy() also does this before performing the memory copy, so you don’t need cudaThreadSynchronize() before it.

Calling cudaThreadSynchronize() after a kernel is useful for two things:

  • Correct timing of the run time of the kernel. Start the timer before the kernel, run kernel, run cudaThreadSynchronize(), then stop the timer.

  • Checking error codes generated by the kernel before moving on. By waiting until the kernel is finished, cudaThreadSynchronize() will return any errors generated during execution.

If your code works in emulation mode but not on the device, then it usually means: kernel failing to launch, kernel crashing during execution (due to host pointers, etc), or a race condition in your kernel code. The first two problems can be checked for by calling cudaThreadSynchronize() after the kernel and looking at the return code. The last problem is much harder to solve. You have to read your code carefully and look at places where two different threads can read and write to the same memory locations.

OK. So we have results of sample tests which are tested to be correct for sequential code. When I run with deviceemu, they match to these resutls but not without that option.

How can I check whether there is any problem with the kernel launch by using cudaThreadSynchronize(). I am new to all this. Could you please elaborate a bit.

OK. So I have found out that there some error with the kernel launch. It is giving me: “Unspecified Kernel Launch”.

FWIW, I am passing a pointer to a structure in the kernel and using it there after. I also found out that 1st 12 threads do not run at all, possibly due to error in kernel launch.

How can I solve this??

Please help.

This error is like the CUDA version of a segmentation fault. It is generally associated with some kind of memory access error, possibly due to passing a host pointer to the device.

If you want to pass a structure to the device with a pointer, you need to allocate memory on the device with cudaMalloc(), then copy your host structure to the device (using the pointer from cudaMalloc) with cudaMemcpy(). Then, if your kernel makes any changes to the structure and you want them back on the host, you will need to cudaMemcpy() in the reverse direction (device to host) after the kernel.

What are the contents of your struct? There are other things to consider depending on what (and how much) you are trying to move to the device.

I am doing something like this:

typedef struct MyStruct


float *pData;

//… any number of arrays that you’d like…


MyStruct hostStruct;

MyStruct *deviceStruct;

int iSize = 100 * sizeof( float );

float *pInputData = new float[ 100 ]; // and fill it with data…

cudaMalloc( ( void ** )&( hostStruct.pData ), iSize);

cudaMemcpy( hostStruct.pData, pInputData, iSize, cudaMemcpyHostToDevice );

// Now copy the host structure into the device structure…

cudaMalloc( ( void ** )&( deviceStruct ), sizeof( MyStruct ) );

cudaMemcpy( deviceStruct, &hostStruct, sizeof( MyStruct ), cudaMemcpyHostToDevice );

myKernel<<< … >>>( deviceStruct, … );

cudaThread Synchronize();

cudaMemcpy(hostResultArray, hostStruct.resultData, sizeof(float)*no, cudaMemcpyDeviceToHost);

Basically what I am doing is, I am calculating Potential of atoms due to all other atoms using the kernel, thus, I am writing the results to the structure that I am passing. And after that, from this structure, I am copying the results back to a result array in the host.

The Structure which I am passing contains 10 differents arrays and 4 variables.