Warp OR reductions


I am following:


on efficient warp reductions for Kepler architecture and it says that the Kepler implementation of shuffle instruction supports only 32-bit data types. I am trying to implement OR warp reductions on 64-bit variables. Replacing “double” with “unsigned long long” to the piece of code on that page doing shuffling on double variables seems buggy. What is the right way to implement shuffling for unsigned long long variables?


CUDA 9/9.1 warp shuffle natively supports 64-bit types: