All threads do the same, thus every one of them adds 1 to ‘var’ and then prints. Because the adding is done as atomic operation value of ‘var’ when printing is 10. To get the output You want both adding and printing should be done in, one atomic operation, which, as far as I know, cannot be done (simply, but You could consider implementing a kind of locking). Such output can be generated by printing thread index plus one, isn’t it?
I suppose that ‘Ind’ is to be incremented every time the above condition is fulfiled. Depending on the number of threads (N) for which the ‘if’ is true, ‘Ind’ will be incremented by 1, N times. Say ‘Ind’ has value 3 before entering the ‘if’. For all threads that enter it ‘Ind’ has the same value. Assume that N = 7 threads enter the ‘if’. Thus after call to ‘atomicAdd’, ‘Ind’ value will be:
Ind = Ind + N * 1 = 3 + 7 * 1 = 10. The printf will produce such output. After ‘if’ executes the value of ‘Ind’ will be 10.
I’m not quite sure what does this code is for, but it doesn’t matter. I think that using a global variable in such a manner, is not a good idea. Consider using shared memory maybe?
My troble is that I must copy in the VettRis, in DIFFERENT POSITION INDICATE FROM “Ind”, only the values that pass the if condition but the threads enter in the same time and write in the same potition (the 0). Can I do a semphores?
To increment ‘Ind’ without data loss or dead locking, You could copy it to shared memory for each thread entering the ‘if’, synchronize them (after ‘if’), and then sum what You copied to shared memory. For those threads that did not enter the ‘if’ shared memory should be initialized to 0. Do the summing in a local variable first, then ‘atomicAdd’ it to global ‘Ind’. I provide code that does what I described, for better understanding what I mean.
#define THREADS_IN_BLOCK 16
#define BLOCKS_IN_GRID 1
__device__ int Ind = 0;
__global__ void kernel()
{
__shared__ int s_Ind[THREADS_IN_BLOCK]; // the more, smaller blocks, the better (?)
int tid = threadIdx.x; // only if the block is one-dimentional
int l_Ind = 0;
s_Ind[tid] = 0;
if (tid > -1 && tid < 5) // do Your if... (here 5 thread enter the 'if')
{
s_Ind[tid] = Ind;
// do Your work...
s_Ind[tid] += 1;
}
__syncthreads();
for (int i = 0; i < THREADS_IN_BLOCK; ++i) l_Ind += s_Ind[i];
if (!tid) atomicAdd(&Ind, l_Ind); // my bad - too much adding was applied here...
printf("%d: s_Ind[%d] = %d, l_Ind = %d, Ind = %d\n", tid, tid, s_Ind[tid], l_Ind, Ind);
// s_Ind = 5, l_Ind = [0 or 1, depending on tid], Ind = l_Ind + Ind = 5
}
int main()
{
cudaSetDevice(0);
kernel<<<BLOCKS_IN_GRID, THREADS_IN_BLOCK>>>();
cudaThreadSynchronize();
getchar();
return 0;
}
tera: Then could You tell me what makes my code unreliable with more then one block (provided the code before my correction in the previous post, beyond the fact that I could have misunderstood something)?
Yes, you can reduce contention for the global variable by having just one thread per block accessing it. Something like this should work for one-dimensional blocks:
#include <stdio.h>
__device__ int var=0;
__global__ void kernel(){
__shared__ int svar;
if (threadIdx.x==0)
svar = atomicAdd(&var,blockDim.x);
__syncthreads();
printf("VAR=%d\n",svar+threadIdx.x);
}
int main(){
kernel<<<3,5>>>();
cudaThreadSynchronize();
return 0;
}