In my application, ALU is too busy and limit performance. How can I move some calculation from ALU to other hardware unit?
Most operation in my application is bit operation xor, funnel shift and addition of integer.
Any advice?
Since SASS instructions are for the most part undocumented, there is no such document.
But if you have an instruction in mind, we can make an educated guess.
If it is possible to use another unit, the compiler will usually do it if it seems beneficial.
I’ve seen in some cases it uses FP16 units to implement MOV (e.g. moving a number to a register by multiplying with zero and adding the value/constant it wants to move there).
What are the most common stall reasons in your code?
It also depends on compute capability.
I think these are all the ALU instructions with compute capability 8.6 (RTX 30-- series):
CS2R,ICMP,ISCADD,ISCADD32I,IMNMX,BFE,BFI,SHR,SHL,ISET,ISETP,SHF,FCMP,FMNMX,FSET,FSETP,GETFPFLAGS,SETFPFLAGS,SEL,FSEL,P2R,R2P,CSET,CSETP,PSET,PSETP,LEPC,VOTE,LEA,PRMT,VMAD,VADD,VABSDIFF,VMNMX,VSET,VSHL,VSHR,VSETP,VABSDIFF4,IDE,IADD3,IADD,IADD32I,LOP,LOP32I,LOP3,XMAD,MOV,MOV32I,MOVM,PLOP3,SGXT,BMSK,IABS,RPCMOV,IMMA,I2I,I2IP,BMMA,SCATTER,SPMETADATA,F2FP,GATHER,GENMETADATA,F2IP,I2FP,BITEXTRACT
And then of course the IMAD, IMUL, IMAD32I, IMUL32I, IDP, IDP4A - on CC 8.6 specifically these few instructions compete with some FP32 instructions for the same hardware units - I think.
If any of your integers can be made warp-uniform, you can offload some of the processing to the uniform datapath. Did you make sure to only use signed integers (if possible), this can be faster.
Other than this, it can be hard to improve if your work is truly compute-bound. Try to reduce unnecessary work/instructions in that case as much as possible.
Nsight compute reports:
- ALU is the highest-utilized pipeline (98.9%)
- Warp cycles per issued instruction (11.7)
- stall not selected(4.94)
- stall math pipe throttle(4.93)
- stall selected(1.0)
6 stall wait(0.75)
As 98.9% of ALU throughput is utlized, I think the only chance to optimize is to move some of integer operation out of ALU.
How to make integers warp-uniform? All my integers are unsigned 32-bit integers
The best practices guide says use signed integers for loop counters. This may give a quite drastic speed-up for you.
It is hard for me to know how to make them warp-uniform without the source code…
Nearly all loops are unrolled because the range of loop is constant at compile time.
The only one loop with a dynamic loop count is the out most loop which loops at most 4 times.
Besides, the counter of the out most loop is not associated with any calculation inside the loop.
So I don’t think changing the counter of loops from unsigned int to int may help.
Ahh okay - If it is a compile time constant I don’t think it makes a difference either. Only in the dynamic case, due to the possibility of unsigned integer overflow.
The main options are 1. a more efficient algorithm if possible (reduce load), 2. move some work to uniform datapath (increase capacity).
Other than that I don’t really see a way to improve this. It seems the code execution is efficient, the only reason you reach the INT32 capacity is probably that this pipeline cannot issue an instruction every cycle (on CC 8.6 they can at most do it every second cycle).
this may be of interest for the original question in the thread title. It’s not an exhaustive mapping.
Funnel shifts tend to be slower / lower throughput than logical operations and simple arithmetic on most processors including GPUs. There are technical reasons for this. You would want to
(1) Look into reducing the number of funnel shifts
(2) Look into reducing the number of logical and arithmetic operations overall
(3) Look into tradeoffs between logical and arithmetic operations
Without specific code to look at, it is hard to tell how much potential speed-up there is to be had. With the advent of LOP3
, a lot of the classical re-organizations of logical operations for optimization purposes (e.g. in crypto codes) have lost their importance or have become meaningless, but careful examination of sequences of multiple LOP3
operations generated by the compiler may reveal that they are not optimal (based on my experience, 50% of the time), and you likely would want to code them by hand in PTX (while this is generally brittle, so far the CUDA compiler doesn’t appear to tease apart LOP3
operations specified at PTX level).
In addition to examining low-level instruction mapping, you would probably want to go back to the algorithmic level to see how you can optimize there (e.g. chose a bit-sliced implementation).
Thank you. I move some operation from ALU to FMU and the performance increased.
That can be a good idea when (1) the range of integers that need to be handled is limited (2) the cost of converting in and out of floating-point space doesn’t nullify the advantage of higher throughput by utilizing the FP path.
The original question did not suggest this was the case, but it is good to hear that you tried on your own (after all, you have 100% of the relevant information in front of you, but passed on only 5% of it here).
Yes,you are right
I have a few more questions.
- Which sass instruction is ptx instruction addc compiled to? IADD or something else?
- The throughput of extended-precision multiply-add(I think it’s ptx instruction madc.) operation confused me. It seems the madc is executed on FMA unit. On sm86, there should be 128 FMA unit per SM because the throughput of fp32 operation is 128 per cycle and it’s also executed on FMA unit. However, the throughput of madc is only 32 per cycle. So it seems like the throughput of madc on each FMA unit is only 1/4 instruction per cycle. Is that correct? If it’s correct, what reason cause the instruction executed so slowly? Or, may be only 1/4 FMA can execute extended-precision multiply-add instruction?
Best regards
(1) A five-minute experiment should suffice to provide a definite answer. What do you observe when you perform this experiment?
(2) Unless you can find a description in some document provided by NVIDIA (including their patent applications), one can only speculate. Per table 3, for sm_86
the plain 32-bit IMAD has a throughput of 64 per cycle, which could be explained by using two passes through the 24-bit multiplier needed for FFMA. Why the “extended precision” variant (IMADC) has half the throughput (32 per cycle, per footnote 6) is not clear, but it may simply be a consequence of cycling through the hardware multiplier twice interacting with the pipeline structure, which creates a pipeline bubble when there is a dependency through the extent/carry bit (the addition only takes place after the second pass through the multiplier).
Per SM, there are only 64 cores capable of INT32, the other 64 being FP32 only.
Ref: Figure 3 of GA102 Whitepaper and text below it.
@rs277 Thanks for the pointer. In an ideal world, askers would study all available NVIDIA documentation in detail first.
Thank you for helping me have a deeper understanding of hardware details. I will do some experiment myself to find out the answer of first question.
Thanks a lot.
Hi spraesi,
I think the ICMP instruction was only available until Maxwell (ISETP does similar things).
ISET now only exists as ISETP.
FCMP is now done by DSETP/FSET/FSETP/HSET2/HSETP2.
Also I believe, BFE, BFI, PSET, CSET, CSETP and XMAD are not available anymore.
Also the vector instructions (?) VMAD, VADD, VMNMX, VSET, VSHL, VSHR, VSETP were removed.
I do not know GETFPFLAGS and SETFPFLAGS. Where those valid SASS instructions?
IMAD, IMUL, IMUL32I, IDP, IDP4A are on the FMA Lighter unit.
IMAD32I does not exist anymore.
Best,
Sebastian
You’re right! I didn’t prune the depreciated instructions. And the last ones mentioned was FMA lighter - but what is the FMA lighter unit??
And It does not seem GETFPFLAGS and SETFPFLAGS are valid, at least not for that compute capability.