Implementing block reduction operations (with no warp shfl)

Hi all,

I have an odd question: What are some good ways to perform block reductions on a matrix or array?
I initially went with __shfl_down_sync(), however one of the devices I am targeting is Maxwell architecture, cuda 7.0. Shuffle primitives are not supported on hardware this old I think. One other option I considered was to use CUB library. Are there any other good suggestions for this? Thanks for your time

shuffle operations are supported on Maxwell. Maxwell is compute capability 5.x

“cuda 7.0” refers to a particular software release. Best not to confuse that or conflate that with compute capability or GPU architecture

shuffle operations appeared in compute capability 3.0 and are supported on all architectures of compute capability 3.0 and higher.

The cub library is a good choice.

If you want to roll your own reduction, there is a cuda sample code outlining it and also this presentation:

https://developer.download.nvidia.com/assets/cuda/files/reduction.pdf

Hmm, this is odd, I remember trying to compile my custom shuffle functions and they did not work on Maxwell. I tried __shfl_down_sync() initially, then the deprecated __shfl_down(). I don’t remember exactly what the issue was, I will try again and report back. Thanks for taking the time to respond

Hi @Robert_Crovella,

I checked and my Cuda runtime is v7.0, and my compute capability is 5.3 (as you said Maxwell would be). When I try to compile my shuffle functions, however, I get this error:

error: identifier “__shfl_down_sync” is undefined

Is this a matter of simply updating my cuda version? Thanks for your time

Yes. the _sync versions of shuffle were not made available until CUDA 9.x

You should be able to use the non-sync version on CUDA 7.0:

__shfl_down

I tried the older version, __shfl_down and get the same error as mentioned above:
error: identifier “__shfl_down” is undefined

Could there be any other cause of this? Do I have to include a special header or something for older versions of cuda runtime API?

You need to identify the correct architecture you are compiling for. A complete example of what you are doing really helps rather than asking these general questions.

CUDA 7.0 targets cc2.0 by default. If you don’t target a specific architecture, you will get that error message because the default architecture does not support warp shuffle.

I didn’t have any trouble with it on CUDA 7.0 when compiling for sm_53:

$ cat t962.cu
#include <stdio.h>

__global__ void k(){

  int a = threadIdx.x;
  int b = __shfl_down(a, 2);
  printf("b = %d\n", b);
}

int main(){

  k<<<1,32>>>();
  cudaDeviceSynchronize();
}
$ /usr/local/cuda-7.0/bin/nvcc t962.cu -arch=sm_53 -o t962
$ /usr/local/cuda-7.0/bin/nvcc t962.cu -o t962
t962.cu(6): error: identifier "__shfl_down" is undefined

1 error detected in the compilation of "/tmp/tmpxft_0000b1be_00000000-9_t962.cpp1.ii".
$

Understood, apologies about that lack of example code - Currently, I am not compiling for a specific architecture. Because I am targeting multiple architectures, I kept my makefiles the same. I guess for Cuda 9.0, the default architecture supports warp shuffle, but for cuda 7.0 as you say, it doesn’t. I will target the specific architecture now. Thanks very much for your help, I appreciate your patience

Yes, CUDA 9.0 default target is cc3.0, which supports warp shuffle