Are operations of add, sum, multiplication and division equivalent in performance regardless of the number of bits?

joojamaranto · July 21, 2023, 6:59pm

In CUDA, are operations like add, sum, multiplication, and division equivalent in performance regardless of the number of bits? Let’s say, I have 2 sets of instructions:

(The instruction I wrote may not reflect real PTX, if that’s the case, call it “Pseudo-PTX”)
First set:
(In this case, A and B are 32 bits long, and all of their bits are flipped to 1)
add.cc.u32 a(dest) b(value1) c(value2);
add.cc.u32 a(dest) b(value1) c(value2);

Second set:
(In this case, A and B are 64 bits long, and all of their bits are flipped to 1)
add.cc.u64 a(dest) b(value1) c(value2);

What would be the result in the end? The first set would execute faster than the second set (In average)? Would the second set execute faster than the first set (In average)? Or would both have the same performance (In average)?

njuffa · July 21, 2023, 8:03pm

The title and the body of the question seem to be asking about two different things.

GPUs have a 32-bit architecture with extensions for 64-bt addressing. 64-bit integer operations are therefore emulated using multiple instructions
and are more expensive than 32-bit integer operations. The difference in performance will be more pronounced the more complex the operation. You can easily measure this for yourself (and I highly recommend that you do), but to give an idea the difference will be roughly a factor of 2x for add and subtract, 3x for multiply, and 4x to 5x for divide.

For the specific case of add.cc.u32; add.cc.u32 versus add.cc.u64 you will likely find that the SASS (machine code) generated is identical, i.e. a sequence of two 32-bit additions. Since PTX is compiled into SASS by an optimizing compiler (ptxas), you would always want to look at SASS when doing performance work. cuobjdump --dump-sass will provide disassembly.

If the context of the question is how one should go about constructing integer arithmetic operation wider than CUDA’s built-in types: These should always be constructed from PTX-level operations that map directly to machine instructions. Generally, these are 32-bit instructions. However, there are some older GPU architectures were even 32-bit multiplies are emulated, and on these one would want to use 16x16->32 multiplies as building blocks for N-bit multiplies and divides (N > 64).

joojamaranto · July 21, 2023, 8:20pm

I understand. But now, assuming the GPU in this case was a A100, with Ampere Architecture. Would it be different? Like, an A100 can handle 64 bit operations without “emulating” them, right? Sorry if it may sound like a weird question, I’m a beginner, by the way.

njuffa · July 21, 2023, 8:23pm

64-bit integer operations are emulated on the A100 as well. You do not have to rely on my saying so, just look at the generated machine code.

joojamaranto · July 21, 2023, 8:23pm

Alright Pal, thanks for your answer.

Topic		Replies	Views
How much speed of 64bit integer algebra in the latest GPUs? CUDA Programming and Performance	2	2129	April 21, 2014
Why are 64 bit integer operations broken into 2 32 bit ops? CUDA Programming and Performance	5	17348	February 17, 2011
Question about 64 Bit Integer Performance CUDA Programming and Performance	12	9763	August 18, 2018
32/64 bit question CUDA Programming and Performance	3	561	February 15, 2024
estimate 64bit integer instruction throughput CUDA Programming and Performance	4	971	September 29, 2018
PTX,... does comparing a bit either a 0 or 1 take 64 bits? CUDA Programming and Performance	3	571	April 13, 2018
64 bit add.cc (among others) CUDA Programming and Performance	9	2699	October 3, 2014
TITAN V / Tesla async 64-bit core CUDA Programming and Performance	7	977	January 4, 2018
can 16-bits and 32-bits Native Arithmetic Instructions run independently ? CUDA Programming and Performance	1	937	December 6, 2019
Integer Arithmetic 32 integer arithmetic performance CUDA Programming and Performance	4	6968	March 7, 2007

Are operations of add, sum, multiplication and division equivalent in performance regardless of the number of bits?

Related topics