Updating Global Array by multiple thread/blocks

Hi there,

Please suggest me the proper way to do this.

I have an global array of size A[N] and a global int counter variable count.

Now, i can have many threads and those threads are divided in blocks.

Each thread computes and return a value that i should put in global array A i defined.

so inside each thread i will have this line:-


Now my questions are:-

  1. Is this line thread safe/block safe? i.e. is it certain that count++ will be done by only one thread at a time.
  2. Is this the right way to approach this kind of problem?

Also my array size can be very big.


Such a global counter is completely utterly terribly thread unsafe. It’s worse than useless.

Using an atomic increment would work but it is a terrible and unsalvagable idea for performance reasons.

But the solution you need is pretty straightforward since you said each thread computes one value.

Use the thread’s ID to determine what data it computes and where it puts the result.

This also gives you free perfect memory coalescing which is often the biggest bottleneck.

Something like:

int index=threadIdx.x + blockIdx.x * blockDim.x; 

if (index<maxN) {

	int val=myFunkyFunction();

	A[index] = val;


If your max N is very large (such that it’s impractical to use that many blocks) you can update the above idea with a loop inside the block so each thread does more than one computation, something like

for (int index=threadIdx.x + blockIdx.x * blockDim.x;  index<maxN; index+=gridDim.x*blockDim.x) {

	int val=myFunkyFunction();

	A[index] = val;


yea, it seems so.

I’m sorry for the confusion, i should have expressed my problem more clearly.

each of my thread generate at max 5 values and at min 0 value, but i will know how much value the thread generates at run-time only… ie. inside the thread only, but i need to pass the array size at the beginning.

Initially i was allocating 5 int spaces for each thread and like u said, i was using the thread idx for offsetting. but it was taking lot of memory unused as many of my thread happens to produce no output i.e. 0 value and only minimal were producing 5 values and rest were producing either 2,3 or 4 values.

so i decided in calling two kernels, one will give me information about the size of array i required using some pre-computation and then second kernel to use that array or fill that array but my problem has been on how to find appropriate index in array where my data will go.

what i am planning to do right now is:

for example if i found that my array size should by 21 from first kernel. with values (count of values) given by each thread to be:-

0 0 0 4 5 0 0 2 3 4 3

then in my second kernel, i will allocate a global array of 21 size and pass the above number as an array which can be used by offset by each thread.

i mean, is this the right way to go? There must be some elegant way to deal with this problem?

i dont think we can dynamically allocate memory in CUDA, can we?

Then what you need to do is a compaction… reducing down a fixed array (with holes) into a shorter array without holes.
It’s done using a sum-prefix step to identify indices for the destinations. It can become more complex if you push for optimal efficiency by doing staged writes to shared memory and then block copies from shared to device memory.

But if you don’t want gritty details and implementation tweaks, you can use the Thrust library for the compaction step which will likely give you decent performance.

For testing, the atomicAdd(&index, 1) method will work fine and take only one line of code. It’s very very very inefficient but easy.