I have an odd question: What are some good ways to perform block reductions on a matrix or array?
I initially went with __shfl_down_sync(), however one of the devices I am targeting is Maxwell architecture, cuda 7.0. Shuffle primitives are not supported on hardware this old I think. One other option I considered was to use CUB library. Are there any other good suggestions for this? Thanks for your time
Hmm, this is odd, I remember trying to compile my custom shuffle functions and they did not work on Maxwell. I tried __shfl_down_sync() initially, then the deprecated __shfl_down(). I don’t remember exactly what the issue was, I will try again and report back. Thanks for taking the time to respond
I checked and my Cuda runtime is v7.0, and my compute capability is 5.3 (as you said Maxwell would be). When I try to compile my shuffle functions, however, I get this error:
error: identifier “__shfl_down_sync” is undefined
Is this a matter of simply updating my cuda version? Thanks for your time
You need to identify the correct architecture you are compiling for. A complete example of what you are doing really helps rather than asking these general questions.
CUDA 7.0 targets cc2.0 by default. If you don’t target a specific architecture, you will get that error message because the default architecture does not support warp shuffle.
I didn’t have any trouble with it on CUDA 7.0 when compiling for sm_53:
$ cat t962.cu
#include <stdio.h>
__global__ void k(){
int a = threadIdx.x;
int b = __shfl_down(a, 2);
printf("b = %d\n", b);
}
int main(){
k<<<1,32>>>();
cudaDeviceSynchronize();
}
$ /usr/local/cuda-7.0/bin/nvcc t962.cu -arch=sm_53 -o t962
$ /usr/local/cuda-7.0/bin/nvcc t962.cu -o t962
t962.cu(6): error: identifier "__shfl_down" is undefined
1 error detected in the compilation of "/tmp/tmpxft_0000b1be_00000000-9_t962.cpp1.ii".
$
Understood, apologies about that lack of example code - Currently, I am not compiling for a specific architecture. Because I am targeting multiple architectures, I kept my makefiles the same. I guess for Cuda 9.0, the default architecture supports warp shuffle, but for cuda 7.0 as you say, it doesn’t. I will target the specific architecture now. Thanks very much for your help, I appreciate your patience