Variable Number of Results

joy · April 9, 2009, 9:39pm

My problem is that I am launching about 10^6 threads, they check for some condition and about 100 of them write some value that its computed.

A simple way would be write 10^6 ones or zeros into a array and then perform a scan on it and then compute the offsets needed and load these values back and then do the writes .

Another option which i am not very sure of is to pre-declare a buffer of about 1000 values and then have a pointer which points to the tail of the buffer and use Atomic Adds to increment it and then read the value just after the increment which tells the thread where i have to put the value.

Lets say this is the code

if ( some_condition ) {

	atomicAdd( pointer_address, 1 );

	buffer[pointer_address] =  value_computed;

}

Lets say thread no 1000 starts executing this bit of code at x clock cycle and thread 2000 starts at time x+1 ;

obviously thread 1000s atomic add would lock access to memory loctation pointer_address, till it finishes incrementing it.

Now will the next atomic write instrunction by thread 2,000 which came in at x+1 th clock cycle lock the memory or will thread 1000 be able to read the global address (pointer_address) so as to compute the correct address to store the value into memory

Is there some sort of way of passing the value read by atomicAdd to the thread

Will this work or is there any other way to make this work.?

Also i would also like to know if GPU has some form of inorder commit guarantee, i.e lets say i do a simple write to position A at clock cycle x; will a read to the same position A at cycle x+1 be deterministic, i.e will it read the value written at prev clock cycle , In essence my question is whether there can be race conditions even when instructions are issued at different clock cycles or is there some form of Re- Order Buffer sort of structure which prevents such kind of RAW hazards. I guess this data is kinda useless for the programmer to know since blocks dont execute in any particular order on the SMs but am just curious .

p.s I did try to search the forum but all it said it is possible to do it using global atomics but no implementation was discussed

MisterAnderson42 · April 10, 2009, 12:23pm

You want to read the previous value from atomicAdd:

if ( some_condition ) {

	unsigned int current_address = atomicAdd( pointer_address, 1 );

	buffer[current_address] =  value_computed;

}

If your ratio is really 10^6 threads to 100 writes, then the performance of this method should be OK as the number of collisions waiting on the atomic op will be small.

If I understand it correclty, __threadfence (coming in CUDA 2.2) will do this.

joy · April 10, 2009, 1:15pm

Hey thanks for the info I rather foolishly missed the fact that there was a return associated with atomic functions

Lets say the writes are more than 100 say 1000 wouldn’t initializing 10 different such buffers speed the process up as there would be less lock up.
And would some sort of a hierarchical approach help speed up things as in have a small buffer in shared memory per block so there is access to the global mem address only once per block

Gregory_Diamos · April 10, 2009, 2:24pm

Both of those suggestions are reasonable. See the histogram sdk projects for examples of programs doing that.

Topic		Replies	Views
GPU Thread Synchronisation CUDA Programming and Performance	2	2501	June 25, 2009
Global memory access how to access the same location sequentially from different threads CUDA Programming and Performance	4	4395	July 29, 2010
Removing RAW race in global memory using __threadfence() CUDA Programming and Performance	6	1077	July 26, 2013
Writes in same memory location Cant add numbers from different threads? CUDA Programming and Performance	46	25880	July 5, 2007
How can I control concurrent write access to global memory? CUDA Programming and Performance	2	915	February 22, 2010
Newbie Question: Threads What's going on here? CUDA Programming and Performance	5	2277	July 18, 2008
question on atomic operation CUDA Programming and Performance	2	3423	November 16, 2011
Any locking mechanism? CUDA Programming and Performance	9	3778	July 25, 2007
Writing results into global array for only some threads CUDA Programming and Performance	5	1765	April 6, 2009
Threadfence or Atomic - propagating global writes CUDA Programming and Performance	4	1757	March 15, 2012

Variable Number of Results

Related topics