Parallel Reduction with shared memory

kriegalex · January 17, 2020, 3:43pm

In order to reduce 192 elements, I’ve decided to use 64 threads. First, we reduce 128 to 192 into 0 to 64, then it comes back to the usual parallel reduction case.

Why then this code:

if (tid < 64) {
  smem[tid] += smem[tid + 128];
  __syncthreads();
}
if (tid < 64) {
  smem[tid] += smem[tid + 64];
  __syncthreads();
}
// unrolling warp
if (tid < 32) {
  volatile float* vsmem = smem;
  vsmem[tid] += vsmem[tid + 32];
  vsmem[tid] += vsmem[tid + 16];
  vsmem[tid] += vsmem[tid + 8];
  vsmem[tid] += vsmem[tid + 4];
  vsmem[tid] += vsmem[tid + 2];
  vsmem[tid] += vsmem[tid + 1];
}

doesn’t provide the same results as this code:

if(tid ==0) {
  unsigned int shift = 64;
  for (unsigned int i = 0; i < shift; i++) {
    smem[i] += smem[i + 128];
  }
  for (unsigned int i = 0; i < shift; i++) {
    smem[i] += smem[i + shift];
  }
  shift /= 2;  // 32
  for (unsigned int i = 0; i < shift; i++) {
    smem[i] += smem[i + shift];
  }
  shift /= 2;  // 16
  for (unsigned int i = 0; i < shift; i++) {
    smem[i] += smem[i + shift];
  }
  shift /= 2;  // 8
  for (unsigned int i = 0; i < shift; i++) {
    smem[i] += smem[i + shift];
  }
  shift /= 2;  // 4
  for (unsigned int i = 0; i < shift; i++) {
    smem[i] += smem[i + shift];
  }
  shift /= 2;  // 2
  for (unsigned int i = 0; i < shift; i++) {
    smem[i] += smem[i + shift];
  }
  smem[0] += smem[1];
}

kriegalex · January 17, 2020, 3:44pm

It seems the volatile keyword is not doing what it is supposed to do, because launching this code in debug works… My book said the volatile ensured the compiler would not mess this part.

Robert_Crovella · January 17, 2020, 4:08pm

I don’t know if you’re claiming that one of these is giving an incorrect answer, or which one it is. The first one seems to give the correct answer for me (in release mode, or debug mode). You may have an issue in code you haven’t shown. (For example, it may be necessary to put a syncthreads after the code you have shown, before inspecting smem[0], although that seems to be unlikely to be the issue to me.)

The second one is not a parallel reduction so I personally wouldn’t bother to analyze it.

kriegalex · January 17, 2020, 4:26pm

I tried adding a sync, didn’t change anything.

if(tid < 32) {
  // previous code
}
if (tid == 0) {
  __syncthreads();
  global_mem[some_id] = smem[0] / output_factor;
}

Robert_Crovella · January 17, 2020, 4:42pm

__syncthreads() inside that conditional construct would be illegal

Anyway, it works for me. If you want to provide a complete code, something else may be discovered.

Here’s my test, on CentOS 7, CUDA 10.1.243, Tesla V100:

$ cat t1636.cu
#include <iostream>
const int ds = 192;
typedef float mt;
template <typename T>
__global__ void r1(T *in, T *out){

  __shared__ T smem[ds];
  int tid = threadIdx.x;
  smem[tid] = in[tid];
  smem[tid+64] = in[tid+64];
  smem[tid+128] = in[tid+128];
  __syncthreads();

if (tid < 64) {
    smem[tid] += smem[tid + 128];
    __syncthreads();
  }
  if (tid < 64) {
    smem[tid] += smem[tid + 64];
    __syncthreads();
  }
// unrolling warp
  if (tid < 32) {
    volatile T* vsmem = smem;
    vsmem[tid] += vsmem[tid + 32];
    vsmem[tid] += vsmem[tid + 16];
    vsmem[tid] += vsmem[tid + 8];
    vsmem[tid] += vsmem[tid + 4];
    vsmem[tid] += vsmem[tid + 2];
    vsmem[tid] += vsmem[tid + 1];
  }
  if (tid == 0) *out = smem[tid];
}

int main(){

  mt *d_d, *h_d, *d_r, *h_r;
  cudaMalloc(&d_d, ds*sizeof(d_d[0]));
  h_d = new mt[ds];
  cudaMalloc(&d_r, sizeof(d_r[0]));
  h_r = new mt[1];
  for (int i = 0; i < ds; i++) h_d[i] = 1;
  cudaMemcpy(d_d, h_d, ds*sizeof(d_d[0]), cudaMemcpyHostToDevice);
  r1<<<1,64>>>(d_d, d_r);
  cudaMemcpy(h_r, d_r, sizeof(d_r[0]), cudaMemcpyDeviceToHost);
  std::cout << h_r[0] << std::endl;
}
$ nvcc -o t1636 t1636.cu
$ cuda-memcheck ./t1636
========= CUDA-MEMCHECK
192
========= ERROR SUMMARY: 0 errors
$

kriegalex · January 17, 2020, 9:24pm

Thank you for your time. Indeed your code snippet works as intended. I think I found it, playing with your code instead of mine. It seems it was a bad idea to use a block as 8x24 instead of 1x192. Do you know why ?

#include <iostream>
const int ds = 192;
const unsigned int nb_sources_per_sector = ds / 8;
typedef float mt;
__constant__ float d_const_of[nb_sources_per_sector];

template <typename T>
__device__ T r2(T var, T in_var, T const_var) {
  return var * const_var / in_var;
}

template <typename T>
__global__ void r1(T var, T *in, T *out){

  __shared__ T smem[ds];
  const unsigned int sector_id = threadIdx.x;
  const unsigned int source_id = threadIdx.y;
  const unsigned int tid = source_id + sector_id * nb_sources_per_sector;
  smem[tid] = r2<float>(var,in[tid],d_const_of[source_id]);
  __syncthreads();

if (tid < 64) {
    smem[tid] += smem[tid + 128];
    __syncthreads();
  }
  if (tid < 64) {
    smem[tid] += smem[tid + 64];
    __syncthreads();
  }
  // unrolling warp
  if (tid < 32) {
    volatile T* vsmem = smem;
    vsmem[tid] += vsmem[tid + 32];
    vsmem[tid] += vsmem[tid + 16];
    vsmem[tid] += vsmem[tid + 8];
    vsmem[tid] += vsmem[tid + 4];
    vsmem[tid] += vsmem[tid + 2];
    vsmem[tid] += vsmem[tid + 1];
  }
  if (tid == 0) *out = smem[tid];
}

int main(){

  mt var = 2.0f;
  mt *d_d, *h_d, *d_r, *h_r, *h_of;
  cudaMalloc(&d_d, ds*sizeof(d_d[0]));
  h_d = new mt[ds];
  cudaMalloc(&d_r, sizeof(d_r[0]));
  h_r = new mt[1];
  h_of = new mt[nb_sources_per_sector];

  for (int i = 0; i < nb_sources_per_sector; i++) h_of[i] = 0.5f;
  for (int i = 0; i < ds; i++) h_d[i] = 0.1f;
  
  cudaMemcpyToSymbol(d_const_of, h_of, nb_sources_per_sector * sizeof(*h_of));
  cudaMemcpy(d_d, h_d, ds*sizeof(d_d[0]), cudaMemcpyHostToDevice);
  
  const dim3 block_size(8, nb_sources_per_sector);  // 192 threads in 2D

  r1<<<1,block_size>>>(var, d_d, d_r);
  
  cudaMemcpy(h_r, d_r, sizeof(d_r[0]), cudaMemcpyDeviceToHost);
  std::cout << h_r[0] << std::endl;
}

EDITED

Robert_Crovella · January 17, 2020, 9:47pm

When assembling warps, threads are grouped in x, then y, then z. You don’t have any control over this:

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#thread-hierarchy

To have a user-created linear thread ID within a block that matches the way warps are assembled (so that, for example, in the first warp we have created thread IDs that are from 0 to 31), its necessary for you to build your thread ID variable correctly.

This won’t achieve that goal:

const unsigned int tid = source_id + sector_id * nb_sources_per_sector;

You would want to do something like this:

const unsigned int tid = sector_id + source_id * nb_sources_per_sector;

I’m not suggesting that is a one-line fix for your code, but you would want to assemble your tid variable in such a way that the underlying threadIdx.x is not multiplied by anything (other than 1). When you multiply it by 24, then the constructs like:

if (tid < XX)

no longer do what you think they do. They are picking a selection of threads scattered across warps.

kriegalex · January 17, 2020, 10:02pm

Thanks for the useful info. I’ve expanded my constant memory array from 24 to 192 by duplicating information. It avoids the need for the source_id variable. It also avoids using modulo like

const float source_id = tid %24;

constexpr unsigned short table[24] = {
    0, 0, 0, 0, 0, 0, 3, 3, 3, 3, 6, 6, 6, 6, 6, 9, 9, 9, 9, 12, 12, 12, 12, 12};

TO

constexpr unsigned short table[192] = {
    0, 0, 0, 0, 0, 0, 3, 3, 3, 3, 6, 6, 6, 6, 6, 9, 9, 9, 9, 12, 12, 12, 12, 12,   // sector 1
    0, 0, 0, 0, 0, 0, 3, 3, 3, 3, 6, 6, 6, 6, 6, 9, 9, 9, 9, 12, 12, 12, 12, 12,   // sector 2
    0, 0, 0, 0, 0, 0, 3, 3, 3, 3, 6, 6, 6, 6, 6, 9, 9, 9, 9, 12, 12, 12, 12, 12,   // sector 3
    0, 0, 0, 0, 0, 0, 3, 3, 3, 3, 6, 6, 6, 6, 6, 9, 9, 9, 9, 12, 12, 12, 12, 12,   // sector 4
    0, 0, 0, 0, 0, 0, 3, 3, 3, 3, 6, 6, 6, 6, 6, 9, 9, 9, 9, 12, 12, 12, 12, 12,   // sector 5
    0, 0, 0, 0, 0, 0, 3, 3, 3, 3, 6, 6, 6, 6, 6, 9, 9, 9, 9, 12, 12, 12, 12, 12,   // sector 6
    0, 0, 0, 0, 0, 0, 3, 3, 3, 3, 6, 6, 6, 6, 6, 9, 9, 9, 9, 12, 12, 12, 12, 12,   // sector 7
    0, 0, 0, 0, 0, 0, 3, 3, 3, 3, 6, 6, 6, 6, 6, 9, 9, 9, 9, 12, 12, 12, 12, 12};  // sector 8

Now I can use a block like 1x192.

Topic		Replies	Views
Does the volatile keyword work properly on global memory CUDA Programming and Performance	4	1440	August 17, 2022
ask help about the SDK demo: reduction CUDA Programming and Performance	5	1294	March 31, 2010
[Solved] CUDA code works on Kepler but not on Maxwell CUDA Programming and Performance	10	3252	October 5, 2015
Best way to pack bits into words for global memory Better than reduce in shared memory? CUDA Programming and Performance	17	6850	June 2, 2012
Doesn't this write to the same thread? CUDA Programming and Performance	7	1302	October 2, 2014
Possible bug in reduction routine from the CUDA 7 SDK CUDA Programming and Performance	3	699	September 13, 2017
Parallel Reduction release vs debug CUDA Programming and Performance	5	1077	October 7, 2013
CUDA FORTRAN shared memory warp-level sum reduction Legacy PGI Compilers	1	3444	May 19, 2014
Code doubt CUDA Programming and Performance	3	599	August 27, 2018
questions about GPU parallel reduction variant CUDA Programming and Performance	1	958	June 27, 2011

Parallel Reduction with shared memory

Related topics