Synchronour array write

Hello everyone,

I want to write a code where all threads append values at the end of an array.
But the problem i am facing is of synchronisation.
Few threads are overwriting at the same index.
Can anybody suggest me some way by which i can append values in an array without overwriting at same index ?

My code is:

count=0;
int index=atomicAdd(count,1);
array[index]=some value;           //I want this instruction to be atomic.

Roughly speaking, what you have outlined should work. But you would need to provide a more complete example in order to see if the code you in infer from that is the code I would infer from that.

The actual code i am trying is the following :

count1=count2=count3=0;                    //These are __device__ int variables (shared for all threads)
int *c1_val=&count1, *c2_val=&count2, *c3_val=&count3;

switch(num)                                //Here num is a parameter which will decide the case
{
case 1:
	int i1=atomicAdd(c1_val,1);
	array1[i1]=some value;						
        break;

case 2:
	int i2=atomicAdd(c2_val,1);
	array2[i2]=some value;
	break;

case 3:
	int i3=atomicAdd(c3_val,1);
	array3[i3]=some value;
	break;
}
dc_counts[0]=*c1_val;				
dc_counts[1]=*c2_val;
dc_counts[2]=*c3_val;

If i pass 500 threads to this kernal, the sum of all counts should finally converge to 500 but for some reason most of the times the count is less than 500 ex. 489 465 etc.
This shows that some of the threads are definately accessing the same location.

I want the cases in this code to be atomic.

most of your code looks OK, this looks like it might be problematic without appropriate synchronization:

dc_counts[0]=*c1_val;				
dc_counts[1]=*c2_val;
dc_counts[2]=*c3_val;

another possibility is that num doesn’t match any of your cases, that would result in a “converged” count less than the number of threads

again, you’ve omitted a lot of stuff that could cause your code to break.

What I suggest if you want debugging assistance is this:

https://devtalk.nvidia.com/default/topic/999733/cuda-programming-and-performance/function-only-works-correctly-with-cuda-8-0-in-release-mode-/post/5109126/#5109126

Respected Mr txbob,
Thank You So Much
For highlighting the dc_counts synchronization problem to me .
I guess that was the only problem.
My code generates correct output now as i moved the counts into the cases.

case 1:
	int i1=atomicAdd(c1_val,1);
	d_c1[i1]=d_p[id];					
	if(*c1_val > dc_counts[0])
		dc_counts[0]=*c1_val;            //The thread with the highest value of c1_val will only change the dc_counts variable.
	break;

I’m not sure that code is correct either. You don’t seem to understand inter-thread race conditions.

The initialization of the counters is also a problem. You can’t initialize them to zero inside the kernel; you need to initize them from inside the host or from a PREVIOUS, serialized, kernel to make sure they’re 0 when all blocks start.

You can’t use syncthreads() to get around the problem, since your atomics are shared kernel-wide, not per-block.