Implementing block reduction operations (with no warp shfl)

a.panella.aus · November 19, 2018, 9:22pm

Hi all,

I have an odd question: What are some good ways to perform block reductions on a matrix or array?
I initially went with __shfl_down_sync(), however one of the devices I am targeting is Maxwell architecture, cuda 7.0. Shuffle primitives are not supported on hardware this old I think. One other option I considered was to use CUB library. Are there any other good suggestions for this? Thanks for your time

Robert_Crovella · November 19, 2018, 9:26pm

shuffle operations are supported on Maxwell. Maxwell is compute capability 5.x

“cuda 7.0” refers to a particular software release. Best not to confuse that or conflate that with compute capability or GPU architecture

shuffle operations appeared in compute capability 3.0 and are supported on all architectures of compute capability 3.0 and higher.

The cub library is a good choice.

If you want to roll your own reduction, there is a cuda sample code outlining it and also this presentation:

[url]https://developer.download.nvidia.com/assets/cuda/files/reduction.pdf[/url]

a.panella.aus · November 19, 2018, 9:53pm

Hmm, this is odd, I remember trying to compile my custom shuffle functions and they did not work on Maxwell. I tried __shfl_down_sync() initially, then the deprecated __shfl_down(). I don’t remember exactly what the issue was, I will try again and report back. Thanks for taking the time to respond

a.panella.aus · November 19, 2018, 10:37pm

Hi @Robert_Crovella,

I checked and my Cuda runtime is v7.0, and my compute capability is 5.3 (as you said Maxwell would be). When I try to compile my shuffle functions, however, I get this error:

error: identifier “__shfl_down_sync” is undefined

Is this a matter of simply updating my cuda version? Thanks for your time

Robert_Crovella · November 20, 2018, 1:35am

Yes. the _sync versions of shuffle were not made available until CUDA 9.x

You should be able to use the non-sync version on CUDA 7.0:

__shfl_down

a.panella.aus · November 20, 2018, 1:44am

I tried the older version, __shfl_down and get the same error as mentioned above:
error: identifier “__shfl_down” is undefined

Could there be any other cause of this? Do I have to include a special header or something for older versions of cuda runtime API?

Robert_Crovella · November 20, 2018, 1:53am

You need to identify the correct architecture you are compiling for. A complete example of what you are doing really helps rather than asking these general questions.

CUDA 7.0 targets cc2.0 by default. If you don’t target a specific architecture, you will get that error message because the default architecture does not support warp shuffle.

I didn’t have any trouble with it on CUDA 7.0 when compiling for sm_53:

$ cat t962.cu
#include <stdio.h>

__global__ void k(){

  int a = threadIdx.x;
  int b = __shfl_down(a, 2);
  printf("b = %d\n", b);
}

int main(){

  k<<<1,32>>>();
  cudaDeviceSynchronize();
}
$ /usr/local/cuda-7.0/bin/nvcc t962.cu -arch=sm_53 -o t962
$ /usr/local/cuda-7.0/bin/nvcc t962.cu -o t962
t962.cu(6): error: identifier "__shfl_down" is undefined

1 error detected in the compilation of "/tmp/tmpxft_0000b1be_00000000-9_t962.cpp1.ii".
$

a.panella.aus · November 20, 2018, 3:38am

Robert_Crovella:

You need to identify the correct architecture you are compiling for. A complete example of what you are doing really helps rather than asking these general questions.

CUDA 7.0 targets cc2.0 by default. If you don’t target a specific architecture, you will get that error message because the default architecture does not support warp shuffle.

I didn’t have any trouble with it on CUDA 7.0 when compiling for sm_53:
$ cat t962.cu
#include <stdio.h>

__global__ void k(){

  int a = threadIdx.x;
  int b = __shfl_down(a, 2);
  printf("b = %d\n", b);
}

int main(){

  k<<<1,32>>>();
  cudaDeviceSynchronize();
}
$ /usr/local/cuda-7.0/bin/nvcc t962.cu -arch=sm_53 -o t962
$ /usr/local/cuda-7.0/bin/nvcc t962.cu -o t962
t962.cu(6): error: identifier "__shfl_down" is undefined

1 error detected in the compilation of "/tmp/tmpxft_0000b1be_00000000-9_t962.cpp1.ii".
$

Understood, apologies about that lack of example code - Currently, I am not compiling for a specific architecture. Because I am targeting multiple architectures, I kept my makefiles the same. I guess for Cuda 9.0, the default architecture supports warp shuffle, but for cuda 7.0 as you say, it doesn’t. I will target the specific architecture now. Thanks very much for your help, I appreciate your patience

Robert_Crovella · November 20, 2018, 4:48am

Yes, CUDA 9.0 default target is cc3.0, which supports warp shuffle

Topic		Replies	Views
matrixMulDrv.cpp undefined references CUDA Programming and Performance	12	5356	October 23, 2010
cudaThreadSynchronize() stalls application CUDA Programming and Performance	10	10988	November 17, 2009
Strange crashes in __device__ function CUDA Programming and Performance	4	1051	August 1, 2016
Undefined reference to library CUDA Programming and Performance	4	4449	November 18, 2007
unable to include cub CUDA Programming and Performance	5	5672	February 2, 2023
undefined reference to `cudaSetupArgument', `cudaLaunch' CUDA Programming and Performance	9	5924	November 12, 2019
copy from 1D array to shared memory matrix in cuda CUDA Programming and Performance	7	2054	June 9, 2015
CUDA 7 Release Candidate Feature Overview: C++11, New Libraries, and More Technical Blog	43	1417	August 8, 2016
Help with cuda 2d array CUDA Programming and Performance	6	7446	September 29, 2014
GeForce 335M + Visual Studio 2012 CUDA Setup and Installation	7	1402	November 30, 2015

Implementing block reduction operations (with no warp shfl)

Related topics