if threadIdx.y == 0 , what this means ? ( taking the last sum value problem)

Hello , I wanted to ask , if I have a code like this:

...
__shared__ float SharedSum[ 32 ];

float theSum = 0.0f;
for ( int i = 0; i < inSize; i+= blockDim.x )
{

	if ( threadIdx.y == 0 )	SharedSum[ threadIdx.x ] = inData[ i + threadIdx.x ];

	__syncthreads();

       for ( int j = 0; j < blockDim.x; j++ )
       {
	    theSum +=  SharedSum[ j ];
	    __syncthreads();
       }


}

    *( ouSum + Idx ) = theSum;

What exactly the if ( threadIdx.y == 0 ) means?

  1. It means that only the zero y-thread is responsible for filling the values to shared memory?

  2. And the rest of y threads ?What are they doing?Nothing?

  3. What are the consequences in the rest of the code?

  4. If instead of threadIdx.y , I had blockIdx.y , it would mean that at every beginning of every block fill the shared memory?

I would be grateful if someone explains all the concepts using this approach.It will help me understand better some things!

Thank you very much!

Any suggestions to the above questions ,please?

Could someone also explain me why with the above code the ouSum array contains only the last sum value?( every index of ouSum contains only the last value).

Even if I put the

*( ouSum + Idx ) = theSum;

after

theSum +=  SharedSum[ j ];

, I am still getting the same! How is that?

Thank you!

  1. yes.
  2. yes, nothing.
  3. What do you mean exactly?
  4. It would mean that only the block that has y-index = 0 would do something. What do you mean with “beginning of the block”?

To your second post: What do you mean with “last sum value”? Maybe you can give more information on the code. For example define Idx. How do you call the kernel? What data do you put in?

Hello and thanks for the help.

By (3) I mean , why are we doing that? (filling shared memory only with threadIdx.y)
By (4) ok , only block 0 will do something ,but again why? ( like (3) )

For the second:

int x = ( blockIdx.x * blockDim.x ) + threadIdx.x; 
int y = ( blockIdx.y * blockDim.y ) + threadIdx.y; 
int Idx = x + y * ( blockDim.x * gridDim.x );

so ,Idx is the global index.

kernel:

myshared<<< 1,1 >>>( mySize, devData,devSum );

where mySize for example is 4 and devData is an array with values [ 0 ,1 ,2 , 3] .
So ,I expect the devSum to be an array [ 0, 1 ,3,6] but instead I am taking [ 6, 6,6,6].

I don’t know… It is your algorithm.
Maybe you should share the purpose of the algorithm. What do you want to do? Where did you get the code from? From your last example I may conclude you want to compute the following sums (in math notation)

out_i = sum_{j=0 to i} in_j

Here is the code ,it calculates the accumulative sum.

using namespace std;

__global__ void AccumSum(
        
		size_t  const inSize,
		float * const inData,
		float * const ouSum )
{

	int x = ( blockIdx.x * blockDim.x ) + threadIdx.x; 
	int y = ( blockIdx.y * blockDim.y ) + threadIdx.y; 
	size_t Idx = x + y * ( blockDim.x * gridDim.x );

	__shared__ float SharedSum[ 32 ];

	float mySum = 0.0f;
	for ( int i = 0; i < inSize; i+= blockDim.x )
	{

		if ( threadIdx.y == 0 )	SharedSum[ threadIdx.x ] = inData[ i + threadIdx.x ];

		__syncthreads();

		for ( int j = 0; j < blockDim.x; j++ )
		{
			mySum +=  SharedSum[ j ];
			__syncthreads();
		}

	}

	
    *( ouSum + Idx ) = mySum;

}

int
main(
	         int argc,
	const char * argv[] )
{

	const int mySize  = 4;

        int BlocksPerGridX    = 1;
	int BlocksPerGridY    = 1;

	int ThreadsPerBlockX  = 2;
	int ThreadsPerBlockY  = 2;

	dim3 BlocksDim ( BlocksPerGridX , BlocksPerGridY );
	dim3 ThreadsPerBlock ( ThreadsPerBlockX , ThreadsPerBlockY );
	
	//host memory
	float * myData , * mySum;
	myData = (float*) malloc ( mySize * sizeof(*myData) );
	mySum =  (float*) malloc ( mySize * sizeof(*mySum) );
	
        //fill host with data
        for ( int i = 0;  i < mySize; i++ )
        myData[ i ] = i;
             
        //device memory
        float  * devData , * devSum;
	gpuErrchk( cudaMalloc( (void**) &devData, mySize * sizeof(*devData) ) );
	gpuErrchk( cudaMalloc( (void**) &devSum, mySize * sizeof(*devSum) ) );
	
	gpuErrchk( cudaMemcpy( devData , myData , mySize * sizeof(*devData), cudaMemcpyHostToDevice ) );
	
	AccumSum<<< BlocksDim , ThreadsPerBlock >>>( mySize,devData,devSum );

gpuErrchk( cudaMemcpy( mySum , devSum , mySize * sizeof(*mySum), cudaMemcpyDeviceToHost ) );
    
    for ( int i = 0;  i < mySize; i++ )
        printf("\n Sum = %f", mySum[ i ] ); 
    	
    printf("\n");
    
    free( myData );
    free( mySum );

    gpuErrchk( cudaFree( devData ) );
    gpuErrchk( cudaFree( devSum) );

    return 0;

}

As you already found out, this code computes the full sum and stores it in each element of the output array. But it does this in complicated way.

First it stores every blockDim.x element starting with 0 in shared memory. Then in line 26 to 30 it sums up these elements in ALL threads. Then (next i iteration) it stores again every blockDim.x element starting from 1. And sums up again. And so on…
mySum has always the same value in each thread.

Ok,

  1. About the treadIdx.y == 0 ? Why using it?What if not?

  2. When you say:

Why ALL threads?From where do you understand this?

  1. How can I have the sum I want , [ 0,1,3,6] instead of [6,6,6,6]?

Thank you very much for your help!!

Sorry. I don’t have the time now to rewrite your code, i.e. to answer 3.

But to 1) and 2)
If you call a kernel for example with a 2x2 thread block then each thread (2*2 threads) will execute each line of the kernel code. That means line 26 to 30 for example will be executed by all threads (because there is no condition on these lines), whereas line 21 is only executed by 2 threads (the threads with y coordinate == 0).
In line 21 these 2 threads (ThreadIdx.y == 0) copy 2 elements to shared memory (i==0: inData[0] and inData[2] and for i==1: inData[1] and inData[3]).

Ok , thank you.

Now , you said we have 2x2 threads , hence 2 threads in x and 2 in y direction.
So , when threadIdx.y == 0 , only 1 thread will write to shared memory.Why are you saying 2 threads?

If you have 2x2 threads that means 4 threads with the indices (threadIdx.x, threadIdx.y) in the following combinations: (0,0), (1,0), (0,1), (1,1)
There are two with threadIdx.y == 0.

Ok , I got it now, thanks!

But why does he uses this?

Probably, all threads need all data elements. They are cached in shared memory and then read from there.

OK, thanks for your help.

Anyone who can tell me how I can modify the above code in order to have an output array with all accumulated values, I’ll appreciate!