Is there a document about in which hardware unit(ie. ALU FMU...) an instruction is executed?

SparkHu · September 9, 2022, 9:39am

In my application, ALU is too busy and limit performance. How can I move some calculation from ALU to other hardware unit?
Most operation in my application is bit operation xor, funnel shift and addition of integer.
Any advice?

spraesi · September 9, 2022, 11:44am

Since SASS instructions are for the most part undocumented, there is no such document.
But if you have an instruction in mind, we can make an educated guess.

If it is possible to use another unit, the compiler will usually do it if it seems beneficial.
I’ve seen in some cases it uses FP16 units to implement MOV (e.g. moving a number to a register by multiplying with zero and adding the value/constant it wants to move there).

What are the most common stall reasons in your code?

spraesi · September 9, 2022, 12:00pm

It also depends on compute capability.
I think these are all the ALU instructions with compute capability 8.6 (RTX 30-- series):

CS2R,ICMP,ISCADD,ISCADD32I,IMNMX,BFE,BFI,SHR,SHL,ISET,ISETP,SHF,FCMP,FMNMX,FSET,FSETP,GETFPFLAGS,SETFPFLAGS,SEL,FSEL,P2R,R2P,CSET,CSETP,PSET,PSETP,LEPC,VOTE,LEA,PRMT,VMAD,VADD,VABSDIFF,VMNMX,VSET,VSHL,VSHR,VSETP,VABSDIFF4,IDE,IADD3,IADD,IADD32I,LOP,LOP32I,LOP3,XMAD,MOV,MOV32I,MOVM,PLOP3,SGXT,BMSK,IABS,RPCMOV,IMMA,I2I,I2IP,BMMA,SCATTER,SPMETADATA,F2FP,GATHER,GENMETADATA,F2IP,I2FP,BITEXTRACT

And then of course the IMAD, IMUL, IMAD32I, IMUL32I, IDP, IDP4A - on CC 8.6 specifically these few instructions compete with some FP32 instructions for the same hardware units - I think.

If any of your integers can be made warp-uniform, you can offload some of the processing to the uniform datapath. Did you make sure to only use signed integers (if possible), this can be faster.
Other than this, it can be hard to improve if your work is truly compute-bound. Try to reduce unnecessary work/instructions in that case as much as possible.

SparkHu · September 9, 2022, 12:03pm

Nsight compute reports:

ALU is the highest-utilized pipeline (98.9%)
Warp cycles per issued instruction (11.7)
stall not selected(4.94)
stall math pipe throttle(4.93)
stall selected(1.0)
6 stall wait(0.75)

As 98.9% of ALU throughput is utlized, I think the only chance to optimize is to move some of integer operation out of ALU.

SparkHu · September 9, 2022, 12:11pm

How to make integers warp-uniform? All my integers are unsigned 32-bit integers

spraesi · September 9, 2022, 1:56pm

The best practices guide says use signed integers for loop counters. This may give a quite drastic speed-up for you.

It is hard for me to know how to make them warp-uniform without the source code…

SparkHu · September 9, 2022, 2:05pm

Nearly all loops are unrolled because the range of loop is constant at compile time.
The only one loop with a dynamic loop count is the out most loop which loops at most 4 times.
Besides, the counter of the out most loop is not associated with any calculation inside the loop.
So I don’t think changing the counter of loops from unsigned int to int may help.

spraesi · September 9, 2022, 2:19pm

Ahh okay - If it is a compile time constant I don’t think it makes a difference either. Only in the dynamic case, due to the possibility of unsigned integer overflow.

The main options are 1. a more efficient algorithm if possible (reduce load), 2. move some work to uniform datapath (increase capacity).
Other than that I don’t really see a way to improve this. It seems the code execution is efficient, the only reason you reach the INT32 capacity is probably that this pipeline cannot issue an instruction every cycle (on CC 8.6 they can at most do it every second cycle).

Robert_Crovella · September 9, 2022, 3:32pm

this may be of interest for the original question in the thread title. It’s not an exhaustive mapping.

njuffa · September 9, 2022, 9:19pm

Funnel shifts tend to be slower / lower throughput than logical operations and simple arithmetic on most processors including GPUs. There are technical reasons for this. You would want to

(1) Look into reducing the number of funnel shifts
(2) Look into reducing the number of logical and arithmetic operations overall
(3) Look into tradeoffs between logical and arithmetic operations

Without specific code to look at, it is hard to tell how much potential speed-up there is to be had. With the advent of LOP3, a lot of the classical re-organizations of logical operations for optimization purposes (e.g. in crypto codes) have lost their importance or have become meaningless, but careful examination of sequences of multiple LOP3 operations generated by the compiler may reveal that they are not optimal (based on my experience, 50% of the time), and you likely would want to code them by hand in PTX (while this is generally brittle, so far the CUDA compiler doesn’t appear to tease apart LOP3 operations specified at PTX level).

In addition to examining low-level instruction mapping, you would probably want to go back to the algorithmic level to see how you can optimize there (e.g. chose a bit-sliced implementation).

SparkHu · September 27, 2022, 2:06am

Thank you. I move some operation from ALU to FMU and the performance increased.

njuffa · September 27, 2022, 2:22am

That can be a good idea when (1) the range of integers that need to be handled is limited (2) the cost of converting in and out of floating-point space doesn’t nullify the advantage of higher throughput by utilizing the FP path.

The original question did not suggest this was the case, but it is good to hear that you tried on your own (after all, you have 100% of the relevant information in front of you, but passed on only 5% of it here).

SparkHu · September 27, 2022, 3:00am

Yes，you are right

SparkHu · September 30, 2022, 4:54am

I have a few more questions.

Which sass instruction is ptx instruction addc compiled to? IADD or something else?
The throughput of extended-precision multiply-add(I think it’s ptx instruction madc.) operation confused me. It seems the madc is executed on FMA unit. On sm86, there should be 128 FMA unit per SM because the throughput of fp32 operation is 128 per cycle and it’s also executed on FMA unit. However, the throughput of madc is only 32 per cycle. So it seems like the throughput of madc on each FMA unit is only 1/4 instruction per cycle. Is that correct? If it’s correct, what reason cause the instruction executed so slowly? Or, may be only 1/4 FMA can execute extended-precision multiply-add instruction?

Best regards

njuffa · September 30, 2022, 6:55am

(1) A five-minute experiment should suffice to provide a definite answer. What do you observe when you perform this experiment?

(2) Unless you can find a description in some document provided by NVIDIA (including their patent applications), one can only speculate. Per table 3, for sm_86 the plain 32-bit IMAD has a throughput of 64 per cycle, which could be explained by using two passes through the 24-bit multiplier needed for FFMA. Why the “extended precision” variant (IMADC) has half the throughput (32 per cycle, per footnote 6) is not clear, but it may simply be a consequence of cycling through the hardware multiplier twice interacting with the pipeline structure, which creates a pipeline bubble when there is a dependency through the extent/carry bit (the addition only takes place after the second pass through the multiplier).

rs277 · September 30, 2022, 7:43am

Per SM, there are only 64 cores capable of INT32, the other 64 being FP32 only.

Ref: Figure 3 of GA102 Whitepaper and text below it.

njuffa · September 30, 2022, 7:50am

@rs277 Thanks for the pointer. In an ideal world, askers would study all available NVIDIA documentation in detail first.

SparkHu · September 30, 2022, 8:06am

Thank you for helping me have a deeper understanding of hardware details. I will do some experiment myself to find out the answer of first question.
Thanks a lot.

Curefab · October 1, 2022, 7:11am

Hi spraesi,
I think the ICMP instruction was only available until Maxwell (ISETP does similar things).
ISET now only exists as ISETP.
FCMP is now done by DSETP/FSET/FSETP/HSET2/HSETP2.
Also I believe, BFE, BFI, PSET, CSET, CSETP and XMAD are not available anymore.
Also the vector instructions (?) VMAD, VADD, VMNMX, VSET, VSHL, VSHR, VSETP were removed.

I do not know GETFPFLAGS and SETFPFLAGS. Where those valid SASS instructions?

IMAD, IMUL, IMUL32I, IDP, IDP4A are on the FMA Lighter unit.
IMAD32I does not exist anymore.

Best,
Sebastian

spraesi · October 1, 2022, 10:13am

You’re right! I didn’t prune the depreciated instructions. And the last ones mentioned was FMA lighter - but what is the FMA lighter unit??

And It does not seem GETFPFLAGS and SETFPFLAGS are valid, at least not for that compute capability.

Topic		Replies	Views
I need help understanding how concurrency of CUDA Cores and Tensor Cores works between Turing and Ampere/Ada? CUDA Programming and Performance cuda , tensorflow , rtx , ampere	10	1815	September 27, 2024
A Question about how Ampere/Lovelace (RTX 3000/4000, GA10X/AD10X) cards handle Warp Dispatching CUDA Programming and Performance	13	454	June 1, 2024
High Compute in Flight, low DRAM Bandwidth usage CUDA Programming and Performance	35	106	January 19, 2025
Mapping of pipelines to functional units CUDA Programming and Performance	9	423	April 11, 2025
Pipeline operator forwarding for integer instructions in CUDA CUDA Programming and Performance cuda , kernel	25	317	July 15, 2024
Forward looking GPU integer performance CUDA Programming and Performance	22	21581	March 20, 2017
Fermi architecture details where can I find them? CUDA Programming and Performance	16	4007	April 8, 2012
Ptxas slow CUDA Programming and Performance cuda , kernel	35	1999	May 2, 2024
'Computations server' application design advice CUDA Programming and Performance	24	12675	March 23, 2007
How close to peak can you get on a CPU? CUDA Programming and Performance	33	2943	November 9, 2010

Is there a document about in which hardware unit(ie. ALU FMU...) an instruction is executed?

Related topics