How to deal with multiple threads writing to the same GPU memory location?

Phoenix97 · October 29, 2008, 2:36am

I’m performing some operations on particles/objects in 3D space, my kernel looks something like this:

__global__ void mykernel(float *x, float *y,  float *z,  float *data_in,  float *data_out){

//perform work here

for loop{

   data_out[f] = something[g]; 

}

}

The location of the index ‘f’ in data_out is dependent on the particle’s location. Currently, one thread performs work on one particle. If two or more particles are close together in 3D, odds are, the ‘f’ values for them will be the same, and the threads associated with those particles will be trying to write to the same location at once. Now, if I run the kernel in its current form, it doesn’t crash, but I will often get different values for the same input data each time I run it.

I can sort of get around this by writing a second kernel, where the first kernel writes the data for all particles into a large array, and the second kernel adds all the data sequentially. But is there a better way?

alex_dubinsky · October 29, 2008, 5:39am

Not really. Global atomics would “work” but they’re for occasional use only, not something to base most of your mem accesses on. They’re horribly inefficient.

Maybe you can completely reinvent your algorithm so that it’s not a problem?

But honestly, outputting separate results and then reducing them sounds like it’d be fairly efficient. (You perform a few dozen operations before issuing the write, correct? If it’s between a few dozen and a few thousand, you should be good, but instead of making two separate kernels, get rid of the launch overhead and just have your kernel flip between two modes.) Fairly efficient is a lot better than horribly inefficient (or non-deterministic), and that’s pretty good in sum.

Phoenix97 · October 30, 2008, 4:21am

I might be able to rearrange it so that each thread operates on one box at a time, but first I would need to go through each particle and determine which box it’s in.
In that case, then, different boxes will have different numbers of particles. But then the question becomes, if M threads start at the same time, and N of them finish early (e.g., they had 1 particle instead of 2) will the scheduler start up N other threads to take their place?

alex_dubinsky · October 31, 2008, 3:02am

right. I thought about such schemes, but figured it’d probably be less efficient than just printing all results and combining them afterward. (Though with the right algorithm it might work well.) On CUDA you want a regular, symmetric algorithm.

Blocks and warps can finish early, but not threads. If only one thread in a warp finishes, it just idles until its warp comrades finish too. Also, if the whole warp finishes early, the other warps fill up the GPU but no actual new warps will start until the whole block finishes. Then a new block is loaded.

Topic		Replies	Views
Writing to several global memory locations from the same kernel CUDA Programming and Performance	1	1380	June 13, 2008
Multiple writes to global memory CUDA Programming and Performance	2	2191	May 6, 2008
write to global memory from multiple threads and racing conditions CUDA Programming and Performance	3	3343	April 26, 2009
Writes in same memory location Cant add numbers from different threads? CUDA Programming and Performance	46	25950	July 5, 2007
Simultaneous write Multiple threads writing to the same memory location? CUDA Programming and Performance	2	1193	June 6, 2010
writing to the same global variable by different threads CUDA Programming and Performance	4	4563	December 9, 2009
Global memory access how to access the same location sequentially from different threads CUDA Programming and Performance	4	4408	July 29, 2010
global memory writing question CUDA Programming and Performance	3	3312	October 7, 2008
Multiple threads writing to same address CUDA Programming and Performance	1	1360	April 25, 2011
write results in parallel creating an unknown number of data elements in each thread CUDA Programming and Performance	5	2423	January 21, 2010

How to deal with multiple threads writing to the same GPU memory location?

Related topics