Thread Synchronisation in parallel array write

Hello all,
I have posted this thread before. That time I thought I got the answer but now I realised I still get improper results.

Explaining my problem,I have a kernal function where threads are concurrently writing into various vectors. I want threads to write to the end of a particular vector.The write operation should be exclusive in nature so I used atomicAdd() operation to provide index of writing for each thread.

Each “case” in the “switch” block represents my write operation on the vector. Each thread calculates an index for itself using atomicAdd and later writes to that index.

The problem which i am facing is that after the execution of this kernal, I found that not all the n threads are writing to the array.
If I pass 3000 threads then only 2700 or so are writing to the array.I found this by adding the counts of each vector (variable dc_counts is used for this purpose).

The code for my kernal function is given below :

template <typename T> __global__ void ClusterFormation(point *d_p,int *d_cnum, int k, T *d_c1, T *d_c2, T *d_c3, int *dc_counts)
{
int id=blockIdx.x * blockDim.x + threadIdx.x;

count1=count2=count3=0;                          // __device__ int variables
dc_counts[0]=dc_counts[1]=dc_counts[2]=0;
int *c1_val=&count1, *c2_val=&count2, *c3_val=&count3;

int num=d_cnum[id];
__syncthreads();
switch(num)
{
case 1:
	int i1=atomicAdd(c1_val,1);
	d_c1[i1]=d_p[id];		
	if(*c1_val > dc_counts[0])            //THread with highest count will write into dc_counts variable
		dc_counts[0]=*c1_val;
	break;

case 2:
	int i2=atomicAdd(c2_val,1);
	d_c2[i2]=d_p[id];
	if(*c2_val > dc_counts[1])
		dc_counts[1]=*c2_val;
	break;

case 3:
	int i3=atomicAdd(c3_val,1);
	d_c3[i3]=d_p[id];
	if(*c3_val > dc_counts[2])
		dc_counts[2]=*c3_val;
	break;
}
}

The code for my function call is given below :

n=3200;
tnum=400;
ClusterFormation<<<n/tnum , tnum>>>(d_p, d_cnum,k,d_c1, d_c2, d_c3,dc_counts);

Kindly help me as i get correct results for elements upto 3000 but beyond that the code simply does not work as expected.

Also let me know if the graphics card I am using can be the cause of my problem. I am using GeForce 820M.

This is a bad idea:

count1=count2=count3=0;                          // __device__ int variables
dc_counts[0]=dc_counts[1]=dc_counts[2]=0;

Initialize those device variables in host code before calling the kernel, and get rid of those lines in your kernel code. This was already pointed out to you here:

https://devtalk.nvidia.com/default/topic/1000195/cuda-programming-and-performance/synchronour-array-write/post/5111145/#5111145

You don’t seem to have followed that advice.

And if you are on windows, be sure that the wddm timeout mechanism is not causing a problem.

If you’re still having trouble after that, my suggestion if you want help is to provide a complete code, that someone else could run to see the issue.

Respected txbob,
I definately followed that advice.
I tried initializing those variables in host code. It shows the following warnings :

warning: a __device__ variable "count3" cannot be directly written in a host function
warning: a __device__ variable "count2" cannot be directly written in a host function
warning: a __device__ variable "count1" cannot be directly written in a host function

Also will it be okay if I try the following inside the kernal :

if(id==0)            //This will enable only a single thread to access there variables.
{
	count1=count2=count3=0;
	dc_counts[0]=dc_counts[1]=dc_counts[2]=0;
}

You’re getting those errors because you are doing it wrong. Use the cudaMemcpyToSymbol function to initialize device or constant variables.

And your proposed solution of using (id==0) won’t help. You don’t understand the nature of CUDA block execution. Not all your threads are executing in lockstep. Some (blocks) start early, some start much later. Those blocks that start later will reset those device variables to zero, even after some earlier blocks have already started incrementing them. Your (id==0) change presumes that block 0 executes first. The CUDA programming model does not guarantee that. So if it doesn’t execute first, then some blocks will be incrementing uninitialized values, and then thread 0 will come along and reset them to zero. Still broken.

Again, if you want help, provide a complete code, as described here:

[url]https://devtalk.nvidia.com/default/topic/999733/cuda-programming-and-performance/function-only-works-correctly-with-cuda-8-0-in-release-mode-/post/5109126/#5109126[/url]

Thank You so much for the guidance sir.
Being a beginner to CUDA I have less understanding about CUDA and I cannot think of aspects you listed above.
I will definately try with cudaMemCpyToSymbol and let you know the results.