Parallel Reduction with shared memory

In order to reduce 192 elements, I’ve decided to use 64 threads. First, we reduce 128 to 192 into 0 to 64, then it comes back to the usual parallel reduction case.

Why then this code:

if (tid < 64) {
  smem[tid] += smem[tid + 128];
  __syncthreads();
}
if (tid < 64) {
  smem[tid] += smem[tid + 64];
  __syncthreads();
}
// unrolling warp
if (tid < 32) {
  volatile float* vsmem = smem;
  vsmem[tid] += vsmem[tid + 32];
  vsmem[tid] += vsmem[tid + 16];
  vsmem[tid] += vsmem[tid + 8];
  vsmem[tid] += vsmem[tid + 4];
  vsmem[tid] += vsmem[tid + 2];
  vsmem[tid] += vsmem[tid + 1];
}

doesn’t provide the same results as this code:

if(tid ==0) {
  unsigned int shift = 64;
  for (unsigned int i = 0; i < shift; i++) {
    smem[i] += smem[i + 128];
  }
  for (unsigned int i = 0; i < shift; i++) {
    smem[i] += smem[i + shift];
  }
  shift /= 2;  // 32
  for (unsigned int i = 0; i < shift; i++) {
    smem[i] += smem[i + shift];
  }
  shift /= 2;  // 16
  for (unsigned int i = 0; i < shift; i++) {
    smem[i] += smem[i + shift];
  }
  shift /= 2;  // 8
  for (unsigned int i = 0; i < shift; i++) {
    smem[i] += smem[i + shift];
  }
  shift /= 2;  // 4
  for (unsigned int i = 0; i < shift; i++) {
    smem[i] += smem[i + shift];
  }
  shift /= 2;  // 2
  for (unsigned int i = 0; i < shift; i++) {
    smem[i] += smem[i + shift];
  }
  smem[0] += smem[1];
}

It seems the volatile keyword is not doing what it is supposed to do, because launching this code in debug works… My book said the volatile ensured the compiler would not mess this part.

I don’t know if you’re claiming that one of these is giving an incorrect answer, or which one it is. The first one seems to give the correct answer for me (in release mode, or debug mode). You may have an issue in code you haven’t shown. (For example, it may be necessary to put a syncthreads after the code you have shown, before inspecting smem[0], although that seems to be unlikely to be the issue to me.)

The second one is not a parallel reduction so I personally wouldn’t bother to analyze it.

I tried adding a sync, didn’t change anything.

if(tid < 32) {
  // previous code
}
if (tid == 0) {
  __syncthreads();
  global_mem[some_id] = smem[0] / output_factor;
}

__syncthreads() inside that conditional construct would be illegal

Anyway, it works for me. If you want to provide a complete code, something else may be discovered.

Here’s my test, on CentOS 7, CUDA 10.1.243, Tesla V100:

$ cat t1636.cu
#include <iostream>
const int ds = 192;
typedef float mt;
template <typename T>
__global__ void r1(T *in, T *out){

  __shared__ T smem[ds];
  int tid = threadIdx.x;
  smem[tid] = in[tid];
  smem[tid+64] = in[tid+64];
  smem[tid+128] = in[tid+128];
  __syncthreads();

if (tid < 64) {
    smem[tid] += smem[tid + 128];
    __syncthreads();
  }
  if (tid < 64) {
    smem[tid] += smem[tid + 64];
    __syncthreads();
  }
// unrolling warp
  if (tid < 32) {
    volatile T* vsmem = smem;
    vsmem[tid] += vsmem[tid + 32];
    vsmem[tid] += vsmem[tid + 16];
    vsmem[tid] += vsmem[tid + 8];
    vsmem[tid] += vsmem[tid + 4];
    vsmem[tid] += vsmem[tid + 2];
    vsmem[tid] += vsmem[tid + 1];
  }
  if (tid == 0) *out = smem[tid];
}

int main(){

  mt *d_d, *h_d, *d_r, *h_r;
  cudaMalloc(&d_d, ds*sizeof(d_d[0]));
  h_d = new mt[ds];
  cudaMalloc(&d_r, sizeof(d_r[0]));
  h_r = new mt[1];
  for (int i = 0; i < ds; i++) h_d[i] = 1;
  cudaMemcpy(d_d, h_d, ds*sizeof(d_d[0]), cudaMemcpyHostToDevice);
  r1<<<1,64>>>(d_d, d_r);
  cudaMemcpy(h_r, d_r, sizeof(d_r[0]), cudaMemcpyDeviceToHost);
  std::cout << h_r[0] << std::endl;
}
$ nvcc -o t1636 t1636.cu
$ cuda-memcheck ./t1636
========= CUDA-MEMCHECK
192
========= ERROR SUMMARY: 0 errors
$

Thank you for your time. Indeed your code snippet works as intended. I think I found it, playing with your code instead of mine. It seems it was a bad idea to use a block as 8x24 instead of 1x192. Do you know why ?

#include <iostream>
const int ds = 192;
const unsigned int nb_sources_per_sector = ds / 8;
typedef float mt;
__constant__ float d_const_of[nb_sources_per_sector];

template <typename T>
__device__ T r2(T var, T in_var, T const_var) {
  return var * const_var / in_var;
}

template <typename T>
__global__ void r1(T var, T *in, T *out){

  __shared__ T smem[ds];
  const unsigned int sector_id = threadIdx.x;
  const unsigned int source_id = threadIdx.y;
  const unsigned int tid = source_id + sector_id * nb_sources_per_sector;
  smem[tid] = r2<float>(var,in[tid],d_const_of[source_id]);
  __syncthreads();

if (tid < 64) {
    smem[tid] += smem[tid + 128];
    __syncthreads();
  }
  if (tid < 64) {
    smem[tid] += smem[tid + 64];
    __syncthreads();
  }
  // unrolling warp
  if (tid < 32) {
    volatile T* vsmem = smem;
    vsmem[tid] += vsmem[tid + 32];
    vsmem[tid] += vsmem[tid + 16];
    vsmem[tid] += vsmem[tid + 8];
    vsmem[tid] += vsmem[tid + 4];
    vsmem[tid] += vsmem[tid + 2];
    vsmem[tid] += vsmem[tid + 1];
  }
  if (tid == 0) *out = smem[tid];
}

int main(){

  mt var = 2.0f;
  mt *d_d, *h_d, *d_r, *h_r, *h_of;
  cudaMalloc(&d_d, ds*sizeof(d_d[0]));
  h_d = new mt[ds];
  cudaMalloc(&d_r, sizeof(d_r[0]));
  h_r = new mt[1];
  h_of = new mt[nb_sources_per_sector];

  for (int i = 0; i < nb_sources_per_sector; i++) h_of[i] = 0.5f;
  for (int i = 0; i < ds; i++) h_d[i] = 0.1f;
  
  cudaMemcpyToSymbol(d_const_of, h_of, nb_sources_per_sector * sizeof(*h_of));
  cudaMemcpy(d_d, h_d, ds*sizeof(d_d[0]), cudaMemcpyHostToDevice);
  
  const dim3 block_size(8, nb_sources_per_sector);  // 192 threads in 2D

  r1<<<1,block_size>>>(var, d_d, d_r);
  
  cudaMemcpy(h_r, d_r, sizeof(d_r[0]), cudaMemcpyDeviceToHost);
  std::cout << h_r[0] << std::endl;
}

EDITED

When assembling warps, threads are grouped in x, then y, then z. You don’t have any control over this:

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#thread-hierarchy

To have a user-created linear thread ID within a block that matches the way warps are assembled (so that, for example, in the first warp we have created thread IDs that are from 0 to 31), its necessary for you to build your thread ID variable correctly.

This won’t achieve that goal:

const unsigned int tid = source_id + sector_id * nb_sources_per_sector;

You would want to do something like this:

const unsigned int tid = sector_id + source_id * nb_sources_per_sector;

I’m not suggesting that is a one-line fix for your code, but you would want to assemble your tid variable in such a way that the underlying threadIdx.x is not multiplied by anything (other than 1). When you multiply it by 24, then the constructs like:

if (tid < XX)

no longer do what you think they do. They are picking a selection of threads scattered across warps.

Thanks for the useful info. I’ve expanded my constant memory array from 24 to 192 by duplicating information. It avoids the need for the source_id variable. It also avoids using modulo like

const float source_id = tid %24;
constexpr unsigned short table[24] = {
    0, 0, 0, 0, 0, 0, 3, 3, 3, 3, 6, 6, 6, 6, 6, 9, 9, 9, 9, 12, 12, 12, 12, 12};

TO

constexpr unsigned short table[192] = {
    0, 0, 0, 0, 0, 0, 3, 3, 3, 3, 6, 6, 6, 6, 6, 9, 9, 9, 9, 12, 12, 12, 12, 12,   // sector 1
    0, 0, 0, 0, 0, 0, 3, 3, 3, 3, 6, 6, 6, 6, 6, 9, 9, 9, 9, 12, 12, 12, 12, 12,   // sector 2
    0, 0, 0, 0, 0, 0, 3, 3, 3, 3, 6, 6, 6, 6, 6, 9, 9, 9, 9, 12, 12, 12, 12, 12,   // sector 3
    0, 0, 0, 0, 0, 0, 3, 3, 3, 3, 6, 6, 6, 6, 6, 9, 9, 9, 9, 12, 12, 12, 12, 12,   // sector 4
    0, 0, 0, 0, 0, 0, 3, 3, 3, 3, 6, 6, 6, 6, 6, 9, 9, 9, 9, 12, 12, 12, 12, 12,   // sector 5
    0, 0, 0, 0, 0, 0, 3, 3, 3, 3, 6, 6, 6, 6, 6, 9, 9, 9, 9, 12, 12, 12, 12, 12,   // sector 6
    0, 0, 0, 0, 0, 0, 3, 3, 3, 3, 6, 6, 6, 6, 6, 9, 9, 9, 9, 12, 12, 12, 12, 12,   // sector 7
    0, 0, 0, 0, 0, 0, 3, 3, 3, 3, 6, 6, 6, 6, 6, 9, 9, 9, 9, 12, 12, 12, 12, 12};  // sector 8

Now I can use a block like 1x192.