I am estimating my cuda algo theoretical throughput. After disassembly, my algo including many 64bit integer instruction with add, xor, shift. From programming guide, I could get 32bit instruction throughput like:
32-bit integer add, extended-precision add, subtract, extended-precision subtract: 128 per clock per sm.
32-bit bitwise AND, OR, XOR: 128 per clock per sm.
32-bit shift instruction throughput: 64 per clock per sm.
But I can not get any data about 64-bit integer add, xor, shift througput, is it half slower than 32-bit instruction?
My GPU is 1080TI by the way
There are no 64-bit integer instructions on the GPU. 64-bit integer operations are emulated using operations on smaller chunks of data, often 32 bit. Addition, subtraction, and logical operations comprise two 32-bit integer operations. The emulations for other 64-bit integer operations vary depending on GPU architecture and context (e.g. in case of shifts whether the shift count is compile-time constant or not). You can find out the specifics by disassembling the machine code for specific test cases with
Thanks for your tips, very helpful. Now I could get alot of 32bit integer instructions after cuobjdump. How could I know how many CLOCK each single instructions consume? Like below, My code has 2865 IADD machine code. I need know how many clock consumed for each single cuda cores.
Are those static instruction counts (e.g. grepped from output of cuobjdump)? If so, you cannot tell anything about performance from them.
If these are dynamic instruction counts, i.e. instructions counted along the critical path of execution, to first order, divide the instruction counts by the throughputs you listed above to get cycles. This will not account for resource conflicts. What architecture is this? You have a fair number of 32-bit (?) IMADs, and those could be low throughput. ISETPs (integer comparisons) may or may no have throughput identical to IADD, you would want to check on that.
In general, I am not a friend of trying to estimate performance based on instruction counting. That was adequate when processors were single core with an in-order scalar pipeline, e.g. maybe up to the 80486 in Intel land, about 30 years ago.
It is much better to measure performance, in particular with a profiler like the CUDA profiler which will point out performance bottlenecks in the code to the user.
Thanks for your suggestion