Just to double check, am I right in thinking that the shfl reduction described in
B.14.5.3 of the programming-guide will lose carry bits? That is, the example
assumes the sum will fit into 32 bits?
Is there anything more elegant than copying two parts of 64 bit numbers, doing the addition
in thread 0 and copying the two parts back again?
The SHFL examples in the PTX manual can be extended to 64-bits by SHFL’ing the lo and hi words and then performing an add with carry.
It should total ~20 instructions in SASS and more in PTX because you’ll probably need to write pack/unpack glue. Basic shuffle scans and reductions work out to about 10 ops * number_of_words_in_type.
The PTX looks like this:
Dumping the SASS confirms it’s 20 ops:
Gist is here. Code is untested.
I will give it ago
I added a second implementation that uses the higher level “add.s64” PTX instruction and is more succinct:
The SASS output appears to be identical.