Possible Data Run on a 1.1 Device (GeForce 9300m)

LucasCampos · February 20, 2011, 11:23pm

Hello there. I’m learning to use CUDA and made a very simple MD program, using Lennard-Jones potencial. I made a kernel do calculate the accelerations, but there’s a possible data race when the new acceleration is summed to the old one. I do understand that AtomicAdd would solve, but the floating-point version it doesn’t work with my board, that’s a 1.1 compute capable board. Any ideas how to do that? The kernel itself is as follows, and how I call it. How to fix that?

P.S.: Any ideas to improve the code would be appreciated aswell.

__global__ void forca (float2 *pos, float2 *acc, float box, int N, float sigma, float eps){

	float2 del;

	float r;

	int i= threadIdx.x+blockIdx.x;

	int j= threadIdx.y+blockIdx.x;

	if ((i != j) && (i < N) && (j<N)){

		del.x=pos[i].x-pos[j].x;

		del.y=pos[i].y-pos[j].y;

		if (del.x > box/2) {

			del.x -= box;

		}

		else if (del.x < -box/2) {

			del.x += box;

		}

		if (del.y > box/2) {

			del.y -= box;

		}

		else if (del.y < -box/2) {

			del.y += box;

		}

		r=del.x*del.x+del.y*del.y;

		acc[i].x+=24*eps*(pow((sigma/r),14)-2*pow((sigma/r),8))*del.x;

		acc[i].y+=24*eps*(pow((sigma/r),14)-2*pow((sigma/r),8))*del.y;

	}

...

float2 *pos, *vel, *acc;

float2 *pos_d, *vel_d, *acc_d;

dim3 dimBlockforca(16, 16);

int nblocksforca=(N/16)*(N/16)+1;

int nblocksmove=(N+256-1)/256;

...

	size_t memSize=N*sizeof(float2);

	pos=(float2 *) malloc (memSize);

	vel=(float2 *) malloc (memSize);

	acc=(float2 *) malloc (memSize);

...

	cudaMalloc( (void **) &pos_d, memSize );

	cudaMalloc( (void **) &vel_d, memSize );

	cudaMalloc( (void **) &acc_d, memSize );

	cudaMemcpy( pos_d, pos, memSize, cudaMemcpyHostToDevice );

	checkCUDAError("Memcpy pos");

	cudaMemcpy( vel_d, vel, memSize, cudaMemcpyHostToDevice );

	checkCUDAError("Memcpy vel");

	cudaMemcpy( acc_d, acc, memSize, cudaMemcpyHostToDevice );

	checkCUDAError("Memcpy acc");

...

		forca <<< nblocksforca, dimBlockforca >>> (pos_d, acc_d, box, N, sigma, eps);

		checkCUDAError("forca");

...

I’m running on

Win7 Professional x64

Microsoft Visual Studio 2008

GeForce 9300M (DevDriver 260.99)

CUDA Toolkit 3.2

EDIT: Just noticed the mistake on the title. Should be Possible Data Race on a 1.1 Device (GeForce 9300M). No visible way to fix that :(

ceearem · February 21, 2011, 1:05pm

Hi

Several things:

Your index calculation is wrong, so you dont have interactions between all particles.

It should be:

int ij=blockIdx.x*blockDim.x+threadIdx.x;

  int i=ij/N;

  int j=ij%N;

There are two ways to solve the racing condition problem:

(i) Use only 1 thread for each particle (i=blockIdx.x*blockDim.x+threadIdx.x) and loop with j over all N particles (for(int j=0;j<N;j++) …). This approach can be improved by using shared memory.

(ii) Use neighbor lists and start for each atom a block with as many threads as there are neighbors (or more specific the maximum number of neighbors per particle). This approach obviously requires more code, since you need to build the neighborlists first. And it only makes sense if you use a cutoff for the pair interaction say 2.5sigma.

The good thing about that approach is that the runtime will scale linearly with the number of atoms. (In fact it will scale sublinear [that is better than linear] as long as you dont fully utilize the GPU.) opposed to N^2 as your approach. The bad thing is it will need more memory. Additionally you need to do a reduction (sum over all contributiosn of a thread) to calculate the total force on a particle.

You might want to take a look at the GPU MD codes which are available for download.

I.e. The USER-CUDA package for LAMMPS (Google Code Archive - Long-term storage for Google Code Project Hosting.) which I develope or HOOMD. Those are obviously much more sophisticated than what you try to do (I assume for learning purposes). But you will probably find a lot of pogramming and algorithmic ideas.

Ceearem

LucasCampos · February 21, 2011, 2:10pm

That did not work, my guess is that this way wasn’t considering how I was calling the kernel.

OK, THAT worked just fine :D. The speed wasn’t (very) affected. I’ll keep using this way.

Your asumption is right. This is just a learning purposed program. I’ll check those codes of yours, probably will find many interesting ideas.

Thank you very much for your help.

LucasCampos

Topic		Replies	Views
how to avoid race condition? CUDA Programming and Performance	7	5658	October 23, 2009
Race Condition CUDA? CUDA Programming and Performance	4	2879	November 15, 2014
CUDA particle simulation, not all particles interact? CUDA Programming and Performance	5	1039	April 5, 2016
Designing a parallel process for particle to mesh scheme, need help CUDA Programming and Performance	3	477	August 16, 2022
Awkward error on simple addition inside thread CUDA Programming and Performance	8	1219	June 26, 2017
Race condition? CUDA Programming and Performance	0	3333	October 14, 2009
Why won't my kernel work? CUDA Programming and Performance	3	4085	January 27, 2010
Race Conditions Deviations on cumulative sum calculation CUDA Programming and Performance	7	2575	April 15, 2012
Data scatter problem CUDA Programming and Performance	3	1220	August 19, 2009
multiple threads writing value to a same variable CUDA Programming and Performance	19	6939	March 20, 2012

Possible Data Run on a 1.1 Device (GeForce 9300m)

Related topics