How to retreive data from eaxh thread?

I have this code :

[codebox]#include <stdio.h>

#include <math.h>

#include <stdlib.h>

device float p[12];

device int counter=0;

device int array_counter[5+1];

// Operation on a single element of a 1D array

global void test0D_cu(int j )

{

int jj,ix;

    ix= threadIdx.x+0;              //   from       0   to  4

    jj = j + ix - 2;                //      "       2       6

    atomicAdd(&counter,1);                          //      increment the (saving)array index

    atomicExch(&array_counter[counter],jj);         //      save for each thread, this count in an array

// printf ( " ix %d array_counter[counter] %d counter %d \n",ix , array_counter[counter] , counter );

}

int main(void)

{

int j,h_array_counter[5+1];

float p1[12];

unsigned char *d_addr = NULL;

cudaGetSymbolAddress( (void**)&d_addr,“p”);

    cudaMemset( d_addr, 0.0f  , sizeof(p));         // reset the device array to 0.0

    cudaGetSymbolAddress( (void**)&d_addr,"array_counter");

    cudaMemset( d_addr, 0  , sizeof(array_counter));         // reset the device array to 0

dim3 gridsize (1);

    dim3 blocksize ( 5);

j = 4; // one test element indices

test0D_cu <<< gridsize,blocksize >>> (j);

cudaMemcpyFromSymbol(p1,p,sizeof(p)); // recover the main array and array_counter

    cudaMemcpyFromSymbol(&h_array_counter,array_counter,sizeof(array_counter));

int ii=1;

    for ( int m = 2 ; m <= 6;  m ++){

    printf ( "ii %d m  %d   p1[jj] %e  %d   \n",ii, m, p1[m],h_array_counter[ii]);

    ii++;

                                    }

}

[/codebox]

Which get these results with :

  1. emulation (what I would like to see)

[codebox]

ii 1 m 2 p1[jj] 0.000000e+00 2

ii 2 m 3 p1[jj] 0.000000e+00 3

ii 3 m 4 p1[jj] 0.000000e+00 4

ii 4 m 5 p1[jj] 0.000000e+00 5

ii 5 m 6 p1[jj] 0.000000e+00 6

[/codebox]

  1. release ( the last collumn is wrong)

[codebox]

ii 1 m 2 p1[jj] 0.000000e+00 0

ii 2 m 3 p1[jj] 0.000000e+00 0

ii 3 m 4 p1[jj] 0.000000e+00 0

ii 4 m 5 p1[jj] 0.000000e+00 0

ii 5 m 6 p1[jj] 0.000000e+00 6

[/codebox]

I would like to get the same result for the release case. Is this possible? It seems that only the last iteration is save in the array, why is that?

Do I have to use __threadfence() ,syncthreads() or anything else ? I already tried __syncthreads() and it did not change anything.

Thank you for your help

Haw can I edit the topic title?

When threads in a warp execute atomic operations simultaneously, only 1 thread is guaranteed to “win”. The emulation warp size is 1, the device warp size is 32.

So what you imagine will happen cannot. Which begs the obvious question, why use atomic operations in this case anyway? It certainly isn’t necessary.

The only reason for atomic operation is because without them, what I want (save the results of each thread) cannot work or I did not find a method to make it work.
So, what is the usefullnes of atomic operations?

They are useful if you use them properly. You aren’t. Look at your code:

atomicAdd(&counter,1); // increment the (saving)array index

atomicExch(&array_counter[counter],jj); // save for each thread, this count in an array

Your access to counter in the atomicExch call is not atomic. There is no guarantee of what value of counter will be used.

I just found a solution to my problem (part of it).

I need to use one array_counter element per thread.

replace:

atomicAdd(&counter,1);		  //	  increment the (saving)array index

		atomicExch(&array_counter[counter],jj);		 //	  save for each thread, this count in an array

by:

array_counter[ix+1] = jj;

emulation:

ii 1 m  2   p1[jj] 0.000000e+00  1

ii 2 m  3   p1[jj] 0.000000e+00  2

ii 3 m  4   p1[jj] 0.000000e+00  3

ii 4 m  5   p1[jj] 0.000000e+00  4

ii 5 m  6   p1[jj] 0.000000e+00  5

release:

ii 1 m  2   p1[jj] 0.000000e+00  1

ii 2 m  3   p1[jj] 0.000000e+00  2

ii 3 m  4   p1[jj] 0.000000e+00  3

ii 4 m  5   p1[jj] 0.000000e+00  4

ii 5 m  6   p1[jj] 0.000000e+00  5

Now I can start to make progress again.

Thanks

"When threads in a warp execute atomic operations simultaneously, only 1 thread is guaranteed to “win”. "

Are you sure?
I supposed after 6 atomic add 6 exchanges were performed. And all with counter=6.

The thing with that code is that the counter read in the exchange is not atomic, and I wouldn’t like to guess when that read is (a) scheduled and (B) executed relative to the the atomic add that precedes it.

The correct (but inefficient for large numbers of threads) version of what you are trying would be

__global__  void test0D_cu(int j )

{

int jj,ix;

		ix= threadIdx.x+0;			  //   from	   0   to  4

		jj = j + ix - 2;				//	  "	   2	   6

		int counter_value = atomicAdd(&counter,1) + 1;	  //	  increment the (saving)array index

		array_counter[counter_value] = jj; 				 //	  save for each thread, this count in an array

}

which ensures that each thread uses the value of counter before that thread’s atomicAdd.

Thanks a lot for your answers, but how do you edit the topic title?