need help in designing kernel problems with different threads modifying the same return variable

Hi there,

seems I’m in over my head, but maybe I’m missing some basics.

Let me describe what the code shall do

    Calculating a kind of Huygens-like reflection of a geometric body, one has to place point scatterers on the surface and sum up their contribution to the return signal.

    Depending on the wavelength the number of scatterers is quite large; I’d like to enable an arbitrary number but would be satisfied with 1e12.

    Having that much scatterers there is no way to hold information on those in (CPU) memory, so their locations etc, are calculated on the fly in a while-continue loop. Thus creating the geometry incrementally I don’t even know beforehand, how many scatterers I’ll end up with, exactly.

    A CPU-implementation of this works just fine (if slow), compiling the same code for the GPU works, too, but on medium sized and larger geometries the kernel takes too long to complete and terminates itself.

    Thus for now I’ve made parallelization by splitting the whole set of scatterers into subsets of arbitrary size and thought of having one thread for each subgeometry.

    Problem is: each scatterer’s reflection will contribute to one cell of the return signal depending on its distance to the transmitter. There is no way to determine beforehand, which scatterers will contribute to a given cell. Even if I’d sort the scatterers according to their distance, the information gathered would be to large to fit in (even CPU-) memory.

    Thus: scatterers from different threads will contribute to the same index of the return signal, the simple line returnsig[index] = returnsig[index] + contribution; will fail because the modification to returnsig by one thread is not seen by another.

    This is demonstrated by using the emulation mode: just counting the scatterers by modifying the code above to returnsig[0] = returnsig[0] + 1; shows that emulated threads catch all facets while real GPU threads miss more than 99% (depending on subset size).

    It is no option to give each thread it’s own return signal and add up, afterwards. Some approximate numbers from my recent test example: length of the return signal: 72000, total scatterers: 1e9, subgeometries/threads: 2000. Even those modest sizes give me a CUDA error out of memory when allocating on the device.


    I have way too many scatterers to even allocate a single int targetindex[NUMBEROFSCATTERERS].

    I have too many scatterers to run all of them in one single thread because then this thread will take too long to complete.

    If I distribute the scatterers over several threads (and several does not mean, I can limit them to something like number of processors), I’ll have the problem that more than one thread will modify the same index in the return signal.

I’m out of ideas, here. I hope I’ve missed something basic that will allow several threads to consistently modify the same variable.

Or you have a bright idea how to organize the problem in a different way…

Thanks in advance

Would using the following help
When doing the add (already know how much we will add to it)

  1. read the current value
  2. update it with new value
  3. read it again
    if result is not what was expected then some other thread/block got there 1st so repeat steps 2 and 3

Alternatively if that might still not detect a collision would be to use a float2 with the returnsig.x storing the signal and returnsig.y being used to detect colisions

  1. read the current value
  2. update with new .x and .y
  3. read them again
    if either .x or .y is not expected value then repeat 2 and 3
  • in 2) might be adding say the blockId to the value in the .y, or something that will maximize chance of detecting a collision.

Another possibility is to repeat the calculations. so on one pass thread x is the ONLY thread allowed to update a particular cells of returnsig

Your problem sounds interesting, but I really don’t know what you’re trying to do. If you post some simple code that captures the essence of the problem you’re solving, I might help.

For now I’ve re-structured my problem, having another dimension I can use for parallelization (I need to repeat the steps described for different viewpoints)


Thanks for the reply, I’ll try your trick some time, but I’m afraid the proposed while(not_succeeded) might result in an endless loop, that’s kernel timeout.

@Uncle Joe

Thanks for your interest, my lengthy explanation was aimed at describing why I used the outlined approach and what alternatives are not feasible due to memory constraints.

The problem itself can be captured quite simply, for example in the untested pseudocode, below:

[codebox]#define NSAMPLES 1000

int sample[NSAMPLES];

int isone;

/* fill test array sample with 0 or 1 */

for (isample=0; isample<NSAMPLES; isample++) {

if ( rand() > 0.5*RAND_MAX ) {

sample[isample] = 1;

} else {

sample[isample] = 0;


/* count ones in test array */

isone = 0;

for isample=0; isample<NSAMPLES; isample++) {

if ( sample[isample] == 1 ) {

isone = isone + 1;



My question now is, if it is possible, to write a kernel that will replace the counting loop, each thread performing the test for one member of the test array, adding up the number of samples being 1, as sketched (again, kind of pseudocode) here:



dimBlock.y = 1;

dimBlock.z = 1;

dimGrid.x = ( NSAMPLES + dimBlock.x - 1 ) / dimBlock.x;

dimGrid.y = 1;

dimGrid.z = 1;

kernel_counter<<<dimGrid, dimBlock>>>(counter_dev, sample_dev);

static global void kernel_counter(int *counter, int *sample_dev)

isample = blockDim.x * blockIdx.x + threadIdx.x;

if ( isample >= NSAMPLES ) return;

if ( sample_dev[isample] == 1 ) *counter = *counter + 1;



OK, I see your problem is pretty simple, at least with the simple code.

*counter = *counter + 1

This is clearly a race condition: multiple threads can read the same counter value, add 1, and write back the same counter value. Since every 32 threads (warp size) operate in lock step, that

would explain why you’re missing so many counts.

In || programming, you usually want to avoid threads writing to the same variable, which is a bottleneck. A very common ||

computation is reduction, which exploits the fact that most computations are associative and can be privatized and done in ||:

((a + b) + c) + d = (a + b) + (c + d)

MakeList(a, b, c) = Concat(MakeList(a, c), MakeList(b)) (assuming you add same elements to list c)

I suggest you read Mark Harris’s report on reduction from the CUDA SDK. Basically, your want the following code:

__global__ void Count(uint *block_counts, uint *in, uint size)


  uint local_count = 0;	 // thread local

	  grid_threads = blockDim.x * gridDim.x;

  for (i = threadIdx.x; i < size; i += grid_threads)	  // each thread processes multiple elements - less overhead


	if (in[i] == 1)



__shared__ uint counts[CUDA_THREADSPERBLOCK]

  counts[threadIdx.x] = local_count;

block_counts[blockidx.x] = Reduce(counts, CUDA_THREADSPERBLOCK);


Then you would launch another kernel to reduce the gridDim.x elements in block_counts

Thanks for your answer, the keywords and the literature, I’ll look it up.

For now I’ve rewritten the kernel to only collect the data and add it up, afterwards, on CPU. Together with a bit of re-shuffling of the nested loops on the CPU, total runtime on my reference example is down from 1000s to 90s.

Having just recently realized that you can use more than one kernel in one *.cu, 'll try to identify another part of CPU-work to parallelize…