Synchronization synchronizing a n body problem.

mayank · September 19, 2009, 10:32pm

Hey,

I am trying to port a Molecular Dynamics package to GPU.
I am done with the coding. When I run my code with deviceemu option, it works fine and the results are correct.

But when I try to run without that option, the results are not correct. I believe this is the problem with how I am using cudaThreadSynchronize().
Are there any thumb rules where to use it. How can I find where am I wrong.

Or is any other way to find out what can I do.

For what is worth, the number of threads are less than the number of n bodies.

Please help.

avidday · September 20, 2009, 7:12am

How exactly are the results from the device “not correct”?

seibert · September 20, 2009, 1:55pm

The rule of thumb is: cudaThreadSynchronize() is almost always unnecessary for correct operation. All it does is wait for the kernel to complete before continuing. However, cudaMemcpy() also does this before performing the memory copy, so you don’t need cudaThreadSynchronize() before it.

Calling cudaThreadSynchronize() after a kernel is useful for two things:

Correct timing of the run time of the kernel. Start the timer before the kernel, run kernel, run cudaThreadSynchronize(), then stop the timer.
Checking error codes generated by the kernel before moving on. By waiting until the kernel is finished, cudaThreadSynchronize() will return any errors generated during execution.

If your code works in emulation mode but not on the device, then it usually means: kernel failing to launch, kernel crashing during execution (due to host pointers, etc), or a race condition in your kernel code. The first two problems can be checked for by calling cudaThreadSynchronize() after the kernel and looking at the return code. The last problem is much harder to solve. You have to read your code carefully and look at places where two different threads can read and write to the same memory locations.

mayank · September 20, 2009, 5:27pm

OK. So we have results of sample tests which are tested to be correct for sequential code. When I run with deviceemu, they match to these resutls but not without that option.

mayank · September 20, 2009, 5:29pm

The rule of thumb is: cudaThreadSynchronize() is almost always unnecessary for correct operation. All it does is wait for the kernel to complete before continuing. However, cudaMemcpy() also does this before performing the memory copy, so you don’t need cudaThreadSynchronize() before it.

Calling cudaThreadSynchronize() after a kernel is useful for two things:

Correct timing of the run time of the kernel. Start the timer before the kernel, run kernel, run cudaThreadSynchronize(), then stop the timer.

Checking error codes generated by the kernel before moving on. By waiting until the kernel is finished, cudaThreadSynchronize() will return any errors generated during execution.

If your code works in emulation mode but not on the device, then it usually means: kernel failing to launch, kernel crashing during execution (due to host pointers, etc), or a race condition in your kernel code. The first two problems can be checked for by calling cudaThreadSynchronize() after the kernel and looking at the return code. The last problem is much harder to solve. You have to read your code carefully and look at places where two different threads can read and write to the same memory locations.

How can I check whether there is any problem with the kernel launch by using cudaThreadSynchronize(). I am new to all this. Could you please elaborate a bit.

mayank · September 20, 2009, 7:06pm

OK. So I have found out that there some error with the kernel launch. It is giving me: “Unspecified Kernel Launch”.

FWIW, I am passing a pointer to a structure in the kernel and using it there after. I also found out that 1st 12 threads do not run at all, possibly due to error in kernel launch.

How can I solve this??

Please help.

seibert · September 20, 2009, 9:19pm

This error is like the CUDA version of a segmentation fault. It is generally associated with some kind of memory access error, possibly due to passing a host pointer to the device.

If you want to pass a structure to the device with a pointer, you need to allocate memory on the device with cudaMalloc(), then copy your host structure to the device (using the pointer from cudaMalloc) with cudaMemcpy(). Then, if your kernel makes any changes to the structure and you want them back on the host, you will need to cudaMemcpy() in the reverse direction (device to host) after the kernel.

What are the contents of your struct? There are other things to consider depending on what (and how much) you are trying to move to the device.

mayank · September 20, 2009, 9:47pm

I am doing something like this:

typedef struct MyStruct

{

float *pData;

//… any number of arrays that you’d like…

};

MyStruct hostStruct;

MyStruct *deviceStruct;

int iSize = 100 * sizeof( float );

float *pInputData = new float[ 100 ]; // and fill it with data…

cudaMalloc( ( void ** )&( hostStruct.pData ), iSize);

cudaMemcpy( hostStruct.pData, pInputData, iSize, cudaMemcpyHostToDevice );

// Now copy the host structure into the device structure…

cudaMalloc( ( void ** )&( deviceStruct ), sizeof( MyStruct ) );

cudaMemcpy( deviceStruct, &hostStruct, sizeof( MyStruct ), cudaMemcpyHostToDevice );

myKernel<<< … >>>( deviceStruct, … );

cudaThread Synchronize();

cudaMemcpy(hostResultArray, hostStruct.resultData, sizeof(float)*no, cudaMemcpyDeviceToHost);

Basically what I am doing is, I am calculating Potential of atoms due to all other atoms using the kernel, thus, I am writing the results to the structure that I am passing. And after that, from this structure, I am copying the results back to a result array in the host.

The Structure which I am passing contains 10 differents arrays and 4 variables.

mayank · September 22, 2009, 7:39pm

Anyone??

Topic		Replies	Views
cudaThreadSynchronize usage CUDA Programming and Performance	3	2931	October 21, 2008
When do I need cudaThreadSynchronize? CUDA Programming and Performance	3	11281	June 16, 2010
No need to check cudaThreadSynchronize() in release mode? CUDA Programming and Performance	9	6346	April 21, 2009
Very strange problem on CUDA CUDA Programming and Performance	7	1176	June 15, 2010
incomprehensible behaviour limitations on kernel calls for host function? CUDA Programming and Performance	12	7037	April 28, 2011
How to check if kernel was launched? Is possible that kernel failed to launch but it was not recorde CUDA Programming and Performance	3	3289	March 8, 2010
the same thing, different time consuming asking for help CUDA Programming and Performance	5	6235	May 26, 2009
Program hangs at cudaThreadsynchronize CUDA Programming and Performance	12	9611	April 7, 2011
cudaDeviceSynchronize() returns cudaErrorMemoryAllocation CUDA Programming and Performance	1	518	February 2, 2018
Got wrong result when not using cudaDeviceSynchronize in threads CUDA Programming and Performance	6	857	February 1, 2024

Synchronization synchronizing a n body problem.

Related topics