Could someone also explain me why with the above code the ouSum array contains only the last sum value?( every index of ouSum contains only the last value).
It would mean that only the block that has y-index = 0 would do something. What do you mean with “beginning of the block”?
To your second post: What do you mean with “last sum value”? Maybe you can give more information on the code. For example define Idx. How do you call the kernel? What data do you put in?
By (3) I mean , why are we doing that? (filling shared memory only with threadIdx.y)
By (4) ok , only block 0 will do something ,but again why? ( like (3) )
For the second:
int x = ( blockIdx.x * blockDim.x ) + threadIdx.x;
int y = ( blockIdx.y * blockDim.y ) + threadIdx.y;
int Idx = x + y * ( blockDim.x * gridDim.x );
so ,Idx is the global index.
kernel:
myshared<<< 1,1 >>>( mySize, devData,devSum );
where mySize for example is 4 and devData is an array with values [ 0 ,1 ,2 , 3] .
So ,I expect the devSum to be an array [ 0, 1 ,3,6] but instead I am taking [ 6, 6,6,6].
I don’t know… It is your algorithm.
Maybe you should share the purpose of the algorithm. What do you want to do? Where did you get the code from? From your last example I may conclude you want to compute the following sums (in math notation)
As you already found out, this code computes the full sum and stores it in each element of the output array. But it does this in complicated way.
First it stores every blockDim.x element starting with 0 in shared memory. Then in line 26 to 30 it sums up these elements in ALL threads. Then (next i iteration) it stores again every blockDim.x element starting from 1. And sums up again. And so on…
mySum has always the same value in each thread.
Sorry. I don’t have the time now to rewrite your code, i.e. to answer 3.
But to 1) and 2)
If you call a kernel for example with a 2x2 thread block then each thread (2*2 threads) will execute each line of the kernel code. That means line 26 to 30 for example will be executed by all threads (because there is no condition on these lines), whereas line 21 is only executed by 2 threads (the threads with y coordinate == 0).
In line 21 these 2 threads (ThreadIdx.y == 0) copy 2 elements to shared memory (i==0: inData[0] and inData[2] and for i==1: inData[1] and inData[3]).
Now , you said we have 2x2 threads , hence 2 threads in x and 2 in y direction.
So , when threadIdx.y == 0 , only 1 thread will write to shared memory.Why are you saying 2 threads?
If you have 2x2 threads that means 4 threads with the indices (threadIdx.x, threadIdx.y) in the following combinations: (0,0), (1,0), (0,1), (1,1)
There are two with threadIdx.y == 0.