64 bit integer shift instruction throughput

hammer256 · June 7, 2011, 8:20pm

So I look around and can’t find a clear answer to this: what is the instruction throughput for 64 integer shifts and bitwise operations? Specifically, I’m using 2.0 hardware (GTX 580). The tables I found indicate that for 32 bit ints the instruction throughput is 16 per clock per multiprocessor for shifts, but nothing on 64 bit ints. Is it basically half of that of 32 bit ints, i.e., 8 instructions per clock per multiprocessor?

hamster143 · June 7, 2011, 8:58pm

64-bit integer shift is not a native operation. In a situation like this, you can write a simple kernel and look at the assembly using cuobjdump tool from CUDA 4.0 SDK.

In this situation, 64-bit integer shift compiles into 2 32-bit shifts and one 32-bit add.

njuffa · June 7, 2011, 9:18pm

As hamster143 points out, there are no native 64-bit shift instructions in current hardware. They are emulated via 32-bit instructions in the most efficient manner possible. How many 32-bit instructions are being generated depends on a number of things:

(1) Target architecture: sm_1x vs sm_2x
(2) 64-bit type: signed vs unsigned
(3) Shift amount: compile time constant vs variable
(3)(a) When shift amount is a compile time constant: < 32 vs >= 32

If I recall correctly, depending on the above factors the number of generated instruction will be anywhere from 2 to 8. You can use cuobjdump to see how many machine instructions are being generated for the particular flavor of 64-bit shift that occurs in your code.

Bitwise operations (i.e. AND, OR, XOR) on 64-bit integers require only two 32-bit operations, as the two halves can be handled independently of each other.

hammer256 · June 8, 2011, 12:01am

I see. Thanks for the info, will look at cuobjdump.

Topic		Replies	Views
estimate 64bit integer instruction throughput CUDA Programming and Performance	4	983	September 29, 2018
Integer instructions performance on Kepler CUDA Programming and Performance	6	3553	January 5, 2014
multiword bit shifting (long intgers) CUDA Programming and Performance	5	7098	October 19, 2010
How much speed of 64bit integer algebra in the latest GPUs? CUDA Programming and Performance	2	2139	April 21, 2014
why shift is slower than integer multiply shift ,integer multiply CUDA Programming and Performance	20	6210	July 1, 2010
Why are 64 bit integer operations broken into 2 32 bit ops? CUDA Programming and Performance	5	17351	February 17, 2011
Are operations of add, sum, multiplication and division equivalent in performance regardless of the number of bits? CUDA Programming and Performance	4	1129	July 21, 2023
32/64 bit question CUDA Programming and Performance	3	573	February 15, 2024
compiler problems with 64bit datatype and logical instructions CUDA Programming and Performance	0	5024	April 22, 2010
Are 64-bit integer instructions natively supported by GPU? CUDA Programming and Performance	1	2387	October 5, 2009

64 bit integer shift instruction throughput

Related topics