Possible Data Run on a 1.1 Device (GeForce 9300m)

Hello there. I’m learning to use CUDA and made a very simple MD program, using Lennard-Jones potencial. I made a kernel do calculate the accelerations, but there’s a possible data race when the new acceleration is summed to the old one. I do understand that AtomicAdd would solve, but the floating-point version it doesn’t work with my board, that’s a 1.1 compute capable board. Any ideas how to do that? The kernel itself is as follows, and how I call it. How to fix that?

P.S.: Any ideas to improve the code would be appreciated aswell.

__global__ void forca (float2 *pos, float2 *acc, float box, int N, float sigma, float eps){

	float2 del;

	float r;

	int i= threadIdx.x+blockIdx.x;

	int j= threadIdx.y+blockIdx.x;

	if ((i != j) && (i < N) && (j<N)){



		if (del.x > box/2) {

			del.x -= box;


		else if (del.x < -box/2) {

			del.x += box;


		if (del.y > box/2) {

			del.y -= box;


		else if (del.y < -box/2) {

			del.y += box;






float2 *pos, *vel, *acc;

float2 *pos_d, *vel_d, *acc_d;

dim3 dimBlockforca(16, 16);

int nblocksforca=(N/16)*(N/16)+1;

int nblocksmove=(N+256-1)/256;


	size_t memSize=N*sizeof(float2);

	pos=(float2 *) malloc (memSize);

	vel=(float2 *) malloc (memSize);

	acc=(float2 *) malloc (memSize);


	cudaMalloc( (void **) &pos_d, memSize );

	cudaMalloc( (void **) &vel_d, memSize );

	cudaMalloc( (void **) &acc_d, memSize );

	cudaMemcpy( pos_d, pos, memSize, cudaMemcpyHostToDevice );

	checkCUDAError("Memcpy pos");

	cudaMemcpy( vel_d, vel, memSize, cudaMemcpyHostToDevice );

	checkCUDAError("Memcpy vel");

	cudaMemcpy( acc_d, acc, memSize, cudaMemcpyHostToDevice );

	checkCUDAError("Memcpy acc");


		forca <<< nblocksforca, dimBlockforca >>> (pos_d, acc_d, box, N, sigma, eps);



I’m running on

Win7 Professional x64

Microsoft Visual Studio 2008

GeForce 9300M (DevDriver 260.99)

CUDA Toolkit 3.2

EDIT: Just noticed the mistake on the title. Should be Possible Data Race on a 1.1 Device (GeForce 9300M). No visible way to fix that :(


Several things:

  1. Your index calculation is wrong, so you dont have interactions between all particles.

It should be:

int ij=blockIdx.x*blockDim.x+threadIdx.x;

  int i=ij/N;

  int j=ij%N;
  1. There are two ways to solve the racing condition problem:

(i) Use only 1 thread for each particle (i=blockIdx.x*blockDim.x+threadIdx.x) and loop with j over all N particles (for(int j=0;j<N;j++) …). This approach can be improved by using shared memory.

(ii) Use neighbor lists and start for each atom a block with as many threads as there are neighbors (or more specific the maximum number of neighbors per particle). This approach obviously requires more code, since you need to build the neighborlists first. And it only makes sense if you use a cutoff for the pair interaction say 2.5sigma.

The good thing about that approach is that the runtime will scale linearly with the number of atoms. (In fact it will scale sublinear [that is better than linear] as long as you dont fully utilize the GPU.) opposed to N^2 as your approach. The bad thing is it will need more memory. Additionally you need to do a reduction (sum over all contributiosn of a thread) to calculate the total force on a particle.

You might want to take a look at the GPU MD codes which are available for download.

I.e. The USER-CUDA package for LAMMPS (http://code.google.com/p/gpulammps/) which I develope or HOOMD. Those are obviously much more sophisticated than what you try to do (I assume for learning purposes). But you will probably find a lot of pogramming and algorithmic ideas.


That did not work, my guess is that this way wasn’t considering how I was calling the kernel.

OK, THAT worked just fine :D. The speed wasn’t (very) affected. I’ll keep using this way.

Your asumption is right. This is just a learning purposed program. I’ll check those codes of yours, probably will find many interesting ideas.

Thank you very much for your help.