So I look around and can’t find a clear answer to this: what is the instruction throughput for 64 integer shifts and bitwise operations? Specifically, I’m using 2.0 hardware (GTX 580). The tables I found indicate that for 32 bit ints the instruction throughput is 16 per clock per multiprocessor for shifts, but nothing on 64 bit ints. Is it basically half of that of 32 bit ints, i.e., 8 instructions per clock per multiprocessor?
64-bit integer shift is not a native operation. In a situation like this, you can write a simple kernel and look at the assembly using cuobjdump tool from CUDA 4.0 SDK.
In this situation, 64-bit integer shift compiles into 2 32-bit shifts and one 32-bit add.
As hamster143 points out, there are no native 64-bit shift instructions in current hardware. They are emulated via 32-bit instructions in the most efficient manner possible. How many 32-bit instructions are being generated depends on a number of things:
(1) Target architecture: sm_1x vs sm_2x
(2) 64-bit type: signed vs unsigned
(3) Shift amount: compile time constant vs variable
(3)(a) When shift amount is a compile time constant: < 32 vs >= 32
If I recall correctly, depending on the above factors the number of generated instruction will be anywhere from 2 to 8. You can use cuobjdump to see how many machine instructions are being generated for the particular flavor of 64-bit shift that occurs in your code.
Bitwise operations (i.e. AND, OR, XOR) on 64-bit integers require only two 32-bit operations, as the two halves can be handled independently of each other.
I see. Thanks for the info, will look at cuobjdump.