need help in designing kernel problems with different threads modifying the same return variable

ekroemer · December 30, 2009, 12:02pm

Hi there,

seems I’m in over my head, but maybe I’m missing some basics.

Let me describe what the code shall do

[*]Calculating a kind of Huygens-like reflection of a geometric body, one has to place point scatterers on the surface and sum up their contribution to the return signal.

[*]Depending on the wavelength the number of scatterers is quite large; I’d like to enable an arbitrary number but would be satisfied with 1e12.

[*]Having that much scatterers there is no way to hold information on those in (CPU) memory, so their locations etc, are calculated on the fly in a while-continue loop. Thus creating the geometry incrementally I don’t even know beforehand, how many scatterers I’ll end up with, exactly.

[*]A CPU-implementation of this works just fine (if slow), compiling the same code for the GPU works, too, but on medium sized and larger geometries the kernel takes too long to complete and terminates itself.

[*]Thus for now I’ve made parallelization by splitting the whole set of scatterers into subsets of arbitrary size and thought of having one thread for each subgeometry.

[*]Problem is: each scatterer’s reflection will contribute to one cell of the return signal depending on its distance to the transmitter. There is no way to determine beforehand, which scatterers will contribute to a given cell. Even if I’d sort the scatterers according to their distance, the information gathered would be to large to fit in (even CPU-) memory.

[*]Thus: scatterers from different threads will contribute to the same index of the return signal, the simple line returnsig[index] = returnsig[index] + contribution; will fail because the modification to returnsig by one thread is not seen by another.

This is demonstrated by using the emulation mode: just counting the scatterers by modifying the code above to returnsig[0] = returnsig[0] + 1; shows that emulated threads catch all facets while real GPU threads miss more than 99% (depending on subset size).

[*]It is no option to give each thread it’s own return signal and add up, afterwards. Some approximate numbers from my recent test example: length of the return signal: 72000, total scatterers: 1e9, subgeometries/threads: 2000. Even those modest sizes give me a CUDA error out of memory when allocating on the device.

So,

[*]I have way too many scatterers to even allocate a single int targetindex[NUMBEROFSCATTERERS].

[*]I have too many scatterers to run all of them in one single thread because then this thread will take too long to complete.

[*]If I distribute the scatterers over several threads (and several does not mean, I can limit them to something like number of processors), I’ll have the problem that more than one thread will modify the same index in the return signal.

I’m out of ideas, here. I hope I’ve missed something basic that will allow several threads to consistently modify the same variable.

Or you have a bright idea how to organize the problem in a different way…

Thanks in advance

kbam · January 4, 2010, 5:40am

Would using the following help
When doing the add (already know how much we will add to it)

read the current value
update it with new value
read it again
if result is not what was expected then some other thread/block got there 1st so repeat steps 2 and 3

Alternatively if that might still not detect a collision would be to use a float2 with the returnsig.x storing the signal and returnsig.y being used to detect colisions

read the current value
update with new .x and .y
read them again
if either .x or .y is not expected value then repeat 2 and 3

in 2) might be adding say the blockId to the value in the .y, or something that will maximize chance of detecting a collision.

Another possibility is to repeat the calculations. so on one pass thread x is the ONLY thread allowed to update a particular cells of returnsig

Uncle_Joe · January 4, 2010, 9:51am

Your problem sounds interesting, but I really don’t know what you’re trying to do. If you post some simple code that captures the essence of the problem you’re solving, I might help.

ekroemer · January 4, 2010, 11:24am

For now I’ve re-structured my problem, having another dimension I can use for parallelization (I need to repeat the steps described for different viewpoints)

@kbam:

Thanks for the reply, I’ll try your trick some time, but I’m afraid the proposed while(not_succeeded) might result in an endless loop, that’s kernel timeout.

@Uncle Joe

Thanks for your interest, my lengthy explanation was aimed at describing why I used the outlined approach and what alternatives are not feasible due to memory constraints.

The problem itself can be captured quite simply, for example in the untested pseudocode, below:

[codebox]define NSAMPLES 1000

int sample[NSAMPLES];

int isone;

/* fill test array sample with 0 or 1 */

for (isample=0; isample<NSAMPLES; isample++) {

if ( rand() > 0.5*RAND_MAX ) {

sample[isample] = 1;

} else {

sample[isample] = 0;

}

/* count ones in test array */

isone = 0;

for isample=0; isample<NSAMPLES; isample++) {

if ( sample[isample] == 1 ) {

isone = isone + 1;

}

}[/codebox]

My question now is, if it is possible, to write a kernel that will replace the counting loop, each thread performing the test for one member of the test array, adding up the number of samples being 1, as sketched (again, kind of pseudocode) here:

[codebox]

dimBlock.x = CUDA_THREADSPERBLOCK;

dimBlock.y = 1;

dimBlock.z = 1;

dimGrid.x = ( NSAMPLES + dimBlock.x - 1 ) / dimBlock.x;

dimGrid.y = 1;

dimGrid.z = 1;

kernel_counter<<<dimGrid, dimBlock>>>(counter_dev, sample_dev);

static global void kernel_counter(int *counter, int *sample_dev)

isample = blockDim.x * blockIdx.x + threadIdx.x;

if ( isample >= NSAMPLES ) return;

if ( sample_dev[isample] == 1 ) *counter = *counter + 1;

return;

}[/codebox]

Uncle_Joe · January 5, 2010, 1:17am

OK, I see your problem is pretty simple, at least with the simple code.

*counter = *counter + 1

This is clearly a race condition: multiple threads can read the same counter value, add 1, and write back the same counter value. Since every 32 threads (warp size) operate in lock step, that

would explain why you’re missing so many counts.

In || programming, you usually want to avoid threads writing to the same variable, which is a bottleneck. A very common ||

computation is reduction, which exploits the fact that most computations are associative and can be privatized and done in ||:

((a + b) + c) + d = (a + b) + (c + d)

MakeList(a, b, c) = Concat(MakeList(a, c), MakeList(b)) (assuming you add same elements to list c)

I suggest you read Mark Harris’s report on reduction from the CUDA SDK. Basically, your want the following code:

__global__ void Count(uint *block_counts, uint *in, uint size)

{

  uint local_count = 0;	 // thread local

	  grid_threads = blockDim.x * gridDim.x;

  for (i = threadIdx.x; i < size; i += grid_threads)	  // each thread processes multiple elements - less overhead

  {

	if (in[i] == 1)

	  ++local_count;

  }

__shared__ uint counts[CUDA_THREADSPERBLOCK]

  counts[threadIdx.x] = local_count;

block_counts[blockidx.x] = Reduce(counts, CUDA_THREADSPERBLOCK);

}

Then you would launch another kernel to reduce the gridDim.x elements in block_counts

ekroemer · January 8, 2010, 9:49am

Thanks for your answer, the keywords and the literature, I’ll look it up.

For now I’ve rewritten the kernel to only collect the data and add it up, afterwards, on CPU. Together with a bit of re-shuffling of the nested loops on the CPU, total runtime on my reference example is down from 1000s to 90s.

Having just recently realized that you can use more than one kernel in one *.cu, 'll try to identify another part of CPU-work to parallelize…

Topic		Replies	Views
slowdown modifying kernel return variable CUDA Programming and Performance	2	2592	December 28, 2009
memory operation inside kernel simultaneous using of variable inside the kernel function CUDA Programming and Performance	2	3104	June 11, 2011
Data scatter problem CUDA Programming and Performance	3	1165	August 19, 2009
device global memory update questions CUDA Programming and Performance	7	5876	April 20, 2009
shared memory problem usage in variables CUDA Programming and Performance	8	2473	September 22, 2010
Simple Question threads adding on the same variable CUDA Programming and Performance	4	2859	October 23, 2009
simple cdot implementation beginner question CUDA Programming and Performance	4	5811	August 21, 2007
How to deal with multiple threads writing to the same GPU memory location? CUDA Programming and Performance	3	5926	October 31, 2008
Number of threads affecting answer; this should not happen a VERY strange error.. CUDA Programming and Performance	8	2494	July 17, 2009
write to global memory from multiple threads and racing conditions CUDA Programming and Performance	3	3308	April 26, 2009

need help in designing kernel problems with different threads modifying the same return variable

Related topics