In CUDA, are operations like add, sum, multiplication, and division equivalent in performance regardless of the number of bits? Let’s say, I have 2 sets of instructions:
(The instruction I wrote may not reflect real PTX, if that’s the case, call it “Pseudo-PTX”)
First set:
(In this case, A and B are 32 bits long, and all of their bits are flipped to 1)
add.cc.u32 a(dest) b(value1) c(value2);
add.cc.u32 a(dest) b(value1) c(value2);
Second set:
(In this case, A and B are 64 bits long, and all of their bits are flipped to 1)
add.cc.u64 a(dest) b(value1) c(value2);
What would be the result in the end? The first set would execute faster than the second set (In average)? Would the second set execute faster than the first set (In average)? Or would both have the same performance (In average)?
The title and the body of the question seem to be asking about two different things.
GPUs have a 32-bit architecture with extensions for 64-bt addressing. 64-bit integer operations are therefore emulated using multiple instructions
and are more expensive than 32-bit integer operations. The difference in performance will be more pronounced the more complex the operation. You can easily measure this for yourself (and I highly recommend that you do), but to give an idea the difference will be roughly a factor of 2x for add and subtract, 3x for multiply, and 4x to 5x for divide.
For the specific case of add.cc.u32; add.cc.u32 versus add.cc.u64 you will likely find that the SASS (machine code) generated is identical, i.e. a sequence of two 32-bit additions. Since PTX is compiled into SASS by an optimizing compiler (ptxas), you would always want to look at SASS when doing performance work. cuobjdump --dump-sass will provide disassembly.
If the context of the question is how one should go about constructing integer arithmetic operation wider than CUDA’s built-in types: These should always be constructed from PTX-level operations that map directly to machine instructions. Generally, these are 32-bit instructions. However, there are some older GPU architectures were even 32-bit multiplies are emulated, and on these one would want to use 16x16->32 multiplies as building blocks for N-bit multiplies and divides (N > 64).
I understand. But now, assuming the GPU in this case was a A100, with Ampere Architecture. Would it be different? Like, an A100 can handle 64 bit operations without “emulating” them, right? Sorry if it may sound like a weird question, I’m a beginner, by the way.