I want to reduce sum values stored in shared memory in cuda as more efficient as possible. I have tried using CUB but it computes wrong results, because of weong usage obviously. Here is my scenario:
I have a shared memory of size: [8][64].
My block size is (64,1) (threadIdx.x → [0,63], threadIdx.y → [0])
I know that I can do this easily by assigning the first 8 threads
thread0: will sum 64 values of the first row
…
thread7: will sum 64 values of the last row
these 8 sums will be stored in global memory later (don’t care about this part).
But this summation introduces heavy performance losses as noted by the nsight compute with shared store bank conflicts.
...
using WarpReduce = cub::WarpReduce<float>;
__shared__ typename WarpReduce::TempStorage temp_storage[2]; //2 warps
int warp_id = threadIdx.x / 32;
//sum 8 rows
float sums[8];
for (int i = 0; i<8; i++) {
float thread_data = shared_mem[i][threadIdx.x];
sums[i] = WarpReduce(temp_storage[warp_id]).Sum(thread_data);
__syncwarp(); //needed here? I put it for safety
}
__syncthreads(); //I think it is needed, again for safety because we have 2 warps.
//now in theory I should check threadIdx.x == 32 to have the "final sums" of each row:
if (threadIdx.x == 32)
{
//store sums[0],....sums[7];
}
Do you find anything wrong with the code snippet?
Thank you!
Oh I was wrong, I should sum BOTH warps, not only thread 32! I thought it was “cumulative” sum, but it is the sum of the warp only, which makes sense.
I added this code to put these in a shared memory of size [8][2] (sums for 2 warps):
if (localId == 0 || localId == 32)
{
for (int i = 0; i < 8; i++)
rxSums[i][warp_id] = sums[i];
}
__syncthreads();
and it works!
…but is is slower than my “custom” reduce summation which uses all 64 threads to “partially sum” 8 horizontal values per row, writing to another shared memory of size [8][8], and then summing this final array.
__shared__ float sharedPartial[8][8];
float reduction_sum_rx = 0.0f
for (int i = 0; i < 8; i++)
reduction_sum_rx += shared[localId / 8][((localId % 8) * 8) + i];
sharedPartial[localId % 8][localId / 8] = reduction_sum_rx;
__syncthreads();
float row_sum = 0.0f;
//write to global memory, ignore this part, we sum the 8 partial sums
if (localId < 8)
{
for (int i = 0; i < 8; i++)
row_sum += sharedPartial[i][localId];
global_mem[(outputIndex / 8) + localId] = row_sum;
}
You could assign 8 threads per row. Offset the reads from one row to the next to avoid bank conflicts. Loop 8 times to read all the values in each row into running sums in each thread. Then use warp shuffle to reduce groups of 8 threads into row sums.
__shared__ float shared[8][64];
float ps = 0;
int my_row = threadIdx.x >> 3;
for (int i = 0; i < 64; i+=8) // 8 loop iterations
ps += shared[my_row][(threadIdx.x + i)&63]; // create per-thread partial sums
for (int i = 4; i > 0; i>>=1) // 3 loop iterations
__shfl_down_sync(0xFFFFFFFF,ps,i); // reduce 32 thread partial sums to 4 row-sums per warp
if ((threadIdx.x & 7) == 0)
global_sums[my_row] = ps;
EDIT: as per below, the above code has an error in it. Use the code in my later/latest posting in this thread
I am having a bit of trouble understanding what’s going on here. For example this:
if ((threadIdx.x & 7) == 0)
global_sums[my_row] = ps;
checks if threadIdx.x is divisible by 8 if I am not mistaken, not? It would write to… 0,8,16,24…56! this would break my global memory write, can I assume that I can write to global memory as before? (I was writing 8 values to global memory per block (64), the first 8 threads would write their sums in the corresponding place of “x,y”):
if (localId < 8)
{
const int outputIndex = (y * width) + x;
for (int i = 0; i < 8; i++)
row_sum += sharedPartial[i][localId];
global_mem[(outputIndex / 8) + localId] = row_sum;
}
why do you say that? It’s writing to indices specified by my_row. For threads from 0…63, those indices span from 0…7.
FWIW I have lightly tested this code. It runs without any kind of error such as you are imagining with this particular comment. Here is a full test case:
# cat t296.cu
#include <iostream>
template <typename T>
__global__ void k(T *global_sums){
__shared__ T shared[8][64];
for (int i = 0; i < 8; i++) shared[i][threadIdx.x] = i+1;
__syncthreads();
T ps = 0;
int my_row = threadIdx.x >> 3;
for (int i = 0; i < 64; i+=8)
ps += shared[my_row][(threadIdx.x + i)&63];
for (int i = 4; i > 0; i>>=1)
__shfl_down_sync(0xFFFFFFFF,ps,i); // reduce 32 results to 4 per warp
if ((threadIdx.x & 7) == 0)
global_sums[my_row] = ps;
}
int main(){
float *d;
cudaMallocManaged(&d, 8*sizeof(d[0]));
k<<<1,64>>>(d);
cudaDeviceSynchronize();
for (int i = 0; i < 8; i++)
std::cout << "row: " << i << " sum: " << d[i] << std::endl;
}
# nvcc -o t296 t296.cu
# compute-sanitizer ./t296
========= COMPUTE-SANITIZER
row: 0 sum: 8
row: 1 sum: 16
row: 2 sum: 24
row: 3 sum: 32
row: 4 sum: 40
row: 5 sum: 48
row: 6 sum: 56
row: 7 sum: 64
========= ERROR SUMMARY: 0 errors
#
I didn’t spend much time trying to organize the 8 row sum results in global memory in any particular way, based on your comment here:
so I just stored them in a set of 8 adjacent locations in global memory.
EDIT: as per below, the above code has an error in it. Use the code in my later/latest posting in this thread
I think I may have not clarified enough it in my original post, but what I wanted to do was to sum each row (64 values) and store them “independently”. In your example, the output should have been
for (int i = 0; i < 8; i++) shared[i][threadIdx.x] = i+1;
__syncthreads();
for example shared[0] would be: [1,1,1,1] (64 times) → 64. We would sum these to get the first output
shared[1] summed would be 128 because [2+2+…2] 64 times → 128 etc
This is why I was trying to use cub 8 times before (this code works, but I have a feeling it is not very efficient based on the metrics, but I am not an expert).
const int localId = threadIdx.x;
const int warpId = localId / 32;
__shared__ float rxLocal[8][64];
__shared__ float rxSums[2][8];
...
float rxThreadSums[8];
#pragma unroll
for (int i = 0; i < 8; i++)
rxThreadSums[i] = cub::WarpReduce<float>(tempStorage[warpId]).Sum(rxLocal[i][localId]);
__syncthreads();
if (localId == 0 || localId == 32)
{
#pragma unroll
for (int i = 0; i < 8; i++)
rxSums[warpId][i] = rxThreadSums[i];
}
__syncthreads();
if (localId < 8)
rx[(outputIndex / 8) + localId] = rxSums[0][localId] + rxSums[1][localId];
Thank you again Robert! I will try now to understand warp shuffling and your code, because I was avoiding them (I was expecting cub to work it out but it seems overkill, I think it would work perfect if there was one summation needed). I will close the issue because you have provided a lot of help, thanks again!