estimate 64bit integer instruction throughput

hijohnny5 · September 28, 2018, 9:27am

Hi
I am estimating my cuda algo theoretical throughput. After disassembly, my algo including many 64bit integer instruction with add, xor, shift. From programming guide, I could get 32bit instruction throughput like:
32-bit integer add, extended-precision add, subtract, extended-precision subtract: 128 per clock per sm.
32-bit bitwise AND, OR, XOR: 128 per clock per sm.
32-bit shift instruction throughput: 64 per clock per sm.

But I can not get any data about 64-bit integer add, xor, shift througput, is it half slower than 32-bit instruction?

My GPU is 1080TI by the way

njuffa · September 28, 2018, 4:28pm

There are no 64-bit integer instructions on the GPU. 64-bit integer operations are emulated using operations on smaller chunks of data, often 32 bit. Addition, subtraction, and logical operations comprise two 32-bit integer operations. The emulations for other 64-bit integer operations vary depending on GPU architecture and context (e.g. in case of shifts whether the shift count is compile-time constant or not). You can find out the specifics by disassembling the machine code for specific test cases with cuobjdump --dump-sass.

hijohnny5 · September 29, 2018, 2:27am

Hi, njiffa
Thanks for your tips, very helpful. Now I could get alot of 32bit integer instructions after cuobjdump. How could I know how many CLOCK each single instructions consume? Like below, My code has 2865 IADD machine code. I need know how many clock consumed for each single cuda cores.

758 SHR
1046 SHL
2865 IADD
793 IMAD
1218 XOR
399 ISETP

njuffa · September 29, 2018, 2:59am

Are those static instruction counts (e.g. grepped from output of cuobjdump)? If so, you cannot tell anything about performance from them.

If these are dynamic instruction counts, i.e. instructions counted along the critical path of execution, to first order, divide the instruction counts by the throughputs you listed above to get cycles. This will not account for resource conflicts. What architecture is this? You have a fair number of 32-bit (?) IMADs, and those could be low throughput. ISETPs (integer comparisons) may or may no have throughput identical to IADD, you would want to check on that.

In general, I am not a friend of trying to estimate performance based on instruction counting. That was adequate when processors were single core with an in-order scalar pipeline, e.g. maybe up to the 80486 in Intel land, about 30 years ago.

It is much better to measure performance, in particular with a profiler like the CUDA profiler which will point out performance bottlenecks in the code to the user.

hijohnny5 · September 29, 2018, 7:03am

Thanks for your suggestion

Topic		Replies	Views
64 bit integer shift instruction throughput CUDA Programming and Performance	3	6779	June 8, 2011
The throughput of 32 bit Integer add instructions not reaching the theoretical maximum of 160 per SM CUDA Programming and Performance	5	1218	January 7, 2014
How much speed of 64bit integer algebra in the latest GPUs? CUDA Programming and Performance	2	2060	April 21, 2014
Throughputs of the 64-bit sine and cosine instructions CUDA Programming and Performance	2	476	January 31, 2022
Mythical Tflops CUDA Programming and Performance	11	1133	January 14, 2019
Peak Performance of integer operation CUDA Programming and Performance	3	2886	May 11, 2017
Throughput for certain integer arithmetic instructions. CUDA Programming and Performance	5	1772	January 15, 2020
Question about 64 Bit Integer Performance CUDA Programming and Performance	12	9126	August 18, 2018
performance of integer vs float CUDA Programming and Performance	10	21606	June 15, 2009
64 bit integer operations CUDA Programming and Performance	6	7796	July 9, 2008

estimate 64bit integer instruction throughput

Related topics