Hello everyone!
I have a kernel with this parameters:
blocksize (64, 1)
I also have a shared memory:
__shared__ float vals[64][36];
What I want to accomplish is very simple: I want to sum all the rows for each column.
So:
thread0 will sum vals[0][0] + vals[1][0] + … + vals[63][0]
thread1 will sum vals[0][1] + vals[1][1] + … + vals[63][1]
(I think you would have noticed that my columns are 36 instead of 64, that’s because I wanted to reduce my shared memory usage based on my specific problem, so I have a “mapping” array that maps the threadidx.x into a correct index in the range [0,35], so my code actually is like this:
float sum= 0.0f;
for (int j = 0; j < 64; j++)
sum+= vals[j][mapping[localId]]; //ignore the "mapping" array
global_mem[outputIndex] = sum;
Each thread sums all the rows of its corresponding column, and writes the sum in the global memory (outputIndex is a transposed index → (y * width) + x)
I was thinking of optimizing this. I cannot use warp shuffling because I think it would be way too heavy for this problem, correct? I had implemented it, but as expected my compute throughput went to 100% almost! I also tried libraries like “cub” and the result was the same. In this case, is it possible that a simple non-reduction sum is faster? I am thinking there must be another way for this, but I can’t figure it out!