Updating Global Array by multiple thread/blocks

miztaken · July 23, 2010, 3:27am

Hi there,

Please suggest me the proper way to do this.

I have an global array of size A[N] and a global int counter variable count.

Now, i can have many threads and those threads are divided in blocks.

Each thread computes and return a value that i should put in global array A i defined.

so inside each thread i will have this line:-

A[count++]=val;

Now my questions are:-

Is this line thread safe/block safe? i.e. is it certain that count++ will be done by only one thread at a time.
Is this the right way to approach this kind of problem?

Also my array size can be very big.

thanks
miztaken

SPWorley · July 23, 2010, 4:48am

Such a global counter is completely utterly terribly thread unsafe. It’s worse than useless.

Using an atomic increment would work but it is a terrible and unsalvagable idea for performance reasons.

But the solution you need is pretty straightforward since you said each thread computes one value.

Use the thread’s ID to determine what data it computes and where it puts the result.

This also gives you free perfect memory coalescing which is often the biggest bottleneck.

Something like:

int index=threadIdx.x + blockIdx.x * blockDim.x; 

if (index<maxN) {

	int val=myFunkyFunction();

	A[index] = val;

}

If your max N is very large (such that it’s impractical to use that many blocks) you can update the above idea with a loop inside the block so each thread does more than one computation, something like

for (int index=threadIdx.x + blockIdx.x * blockDim.x;  index<maxN; index+=gridDim.x*blockDim.x) {

	int val=myFunkyFunction();

	A[index] = val;

}

miztaken · July 23, 2010, 2:17pm

yea, it seems so.

I’m sorry for the confusion, i should have expressed my problem more clearly.

each of my thread generate at max 5 values and at min 0 value, but i will know how much value the thread generates at run-time only… ie. inside the thread only, but i need to pass the array size at the beginning.

Initially i was allocating 5 int spaces for each thread and like u said, i was using the thread idx for offsetting. but it was taking lot of memory unused as many of my thread happens to produce no output i.e. 0 value and only minimal were producing 5 values and rest were producing either 2,3 or 4 values.

so i decided in calling two kernels, one will give me information about the size of array i required using some pre-computation and then second kernel to use that array or fill that array but my problem has been on how to find appropriate index in array where my data will go.

what i am planning to do right now is:

for example if i found that my array size should by 21 from first kernel. with values (count of values) given by each thread to be:-

0 0 0 4 5 0 0 2 3 4 3

then in my second kernel, i will allocate a global array of 21 size and pass the above number as an array which can be used by offset by each thread.

i mean, is this the right way to go? There must be some elegant way to deal with this problem?

i dont think we can dynamically allocate memory in CUDA, can we?

SPWorley · July 23, 2010, 4:44pm

Then what you need to do is a compaction… reducing down a fixed array (with holes) into a shorter array without holes.
It’s done using a sum-prefix step to identify indices for the destinations. It can become more complex if you push for optimal efficiency by doing staged writes to shared memory and then block copies from shared to device memory.

But if you don’t want gritty details and implementation tweaks, you can use the Thrust library for the compaction step which will likely give you decent performance.

For testing, the atomicAdd(&index, 1) method will work fine and take only one line of code. It’s very very very inefficient but easy.

Topic		Replies	Views
Many threads updating a single global variable CUDA Programming and Performance	7	6798	March 30, 2012
Multithreading Counter CUDA Programming and Performance	4	1702	August 14, 2009
Writing results into global array for only some threads CUDA Programming and Performance	5	1722	April 6, 2009
write results in parallel creating an unknown number of data elements in each thread CUDA Programming and Performance	5	2332	January 21, 2010
Array Comparision CUDA Programming and Performance	4	4255	May 31, 2009
Atomic counter as array index CUDA Programming and Performance	2	4392	June 20, 2014
Passing an array to the kernel in constant memory CUDA Programming and Performance	3	4466	March 19, 2011
Unexpected behavior with varying number of threads per block CUDA Programming and Performance	2	3433	November 5, 2008
can one force two operations to occur atomically together? CUDA Programming and Performance	2	1486	June 30, 2015
Many threads updating a single flag in global memory CUDA Programming and Performance	13	6545	May 9, 2011

Updating Global Array by multiple thread/blocks

Related topics