What limits the IPC in CUDA? or How to decrease the avg execution dependency cycles?

For nearly six month now, I am developing a special purpose ray caster based on CUDA that works quite well except for its GFLOPS-utilization of the GPU. In the NSIGHT 2.2 VS2010 profiler, all normally critical values - as branch efficiency, warp issue efficiency or replay overheads - seem to be ok (see below).

According to the profiler’s measurements, the ray caster achieves only 17 % of the GPU’s peak performance (216 vs 1267 GFLOPS).

My thought was that the relatively low number of executed IPC [1.92] (max for compute capability 2.1 is 4) is responsible for this result. As far as I know, this number depends on

the occupancy [0.77] (and with it the number of eligible warps [3.79])
the instruction serialization [0.04] as well as
the execution dependency cycles [16.42] and
the instruction mix [0.65 FMA].

Given these values, I cannot see any other factor that may limit the ray caster that much than the high number of average execution dependency cycles. As the execution dependency [0.49] is also the main stall reason my questions are:

How to decrease these execution dependency cycles?

Is there any other profiler's value that may cause the weak GFLOPS performance?

What is a realistic ratio of achievable performance any way?

For completion:

I use only float precision operations.
The global load cache hit rates are not very good because the ray caster copies the triangles from their acceleration structure into a special shared memory cache only once and leaves them and the global memory untouched further on (except for local memory accesses of course).

Thank you for your time,

Tratos


Profiler Measurements

General

GPU: Geforce 560 Ti
compute capability: 2.1
code generation: compute_20,sm_21
number or CUDA cores: 384
peak performance: 1267 GFLOPS

Setting

registers per thread: 20
dynamic shared memory: 6,748 bytes
local memory: 25,952,256 bytes
grid dimension: {68,68,1}
block dimension: {32,2,4}

Occupancy

achieved occupancy: 0.77
theoretical occupancy: 1.00

Instruction Statistic

GPU issued IPC: 2.00
GPU executed IPC: 1.92
GPU SM activity: 1.00
GPU serialization: 0.04

Branch Statistic

branch efficiency: 0.92
divergence: 0.08
control flow efficiency: 0.92

Issue Efficiency

active warps per active cycle: 36.73
eligible warps per active cycle: 3.79

execution dependency cycles (short): 16.42
execution dependency cycles (long): 5.31
max dependency utilization: 0.92

warp issue efficiency (no eligible): 0.07
warp issue efficiency (one eligible): 0.15

instruction fetch stall reason: 0.35
execution dependency stall reason: 0.49
data request stall reason: 0.08
synchronization stall reason: 0.02
other stall reason: 0.06

Memory Statistic

global replay overhead: 0.01
local replay overhead: 0.06
local replay overhead: 0.06
shared replay overhead: 0.01
bank conflicts/shared requests: 0.03

global transactions/requests (load): 2.44
global transactions/requests (store): 2.00
local transactions/requests (load): 1.04
local transactions/requests (store): 1.00
shared transactions/requests (load): 1.03
shared transactions/requests (store): 2.83

global L1 cache hit rate (load): 0.42
local L1 cache hit rate (load): 0.25
local L1 cache hit rate (store): 0.36
L2 cache hit rate (load): 0.67

FLOPS and Operations

FMA operations percentage: 0.65
MUL operations percentage: 0.24
ADD operations percentage: 0.10
special operations percentage: 0.01

singe FLOP count: 6,873,872,383
runtime: 32 ms
single GFLOPS: 216.63

The low GFLOPS value may simply be a function of the low density of floating-point instructions in the code. Even code that one would think of a “floating-point intensive” often has a surprisingly low percentage of floating-point instructions. I recall looking at this in some detail for a solver code with low GFLOPS, and it turned out only about 20% of instructions were actual floating-point operations, the rest were loads and stores, integer arithmetic for array indexing, control transfer instructions, synchronization. You can check the generated machine code by running cuobjdump --dump-sass on your executable.

From what I understand (I am not an expert here as my work focuses on Tesla, and sm_21 is not used in that line of GPUs), the sm_21 architecture makes it difficult to reach the theoretical max IPC as 48 execution units are available per warp, but you need just the right mix of instructions to utilize the “extra” units. Given that, the IPC looks reasonable to me, the occupancy is very good and should easily allow the covering of all basic latencies. I am not familiar with the “execution dependency cycles” metric and whether it really matters when the occupancy is high. If there is evidence that this is really an issue (at the moment I doubt it), my guess would be it could have to do with the code having fairly small basic blocks that prevent the compiler from scheduling dependent instructions further apart. Overall the statistics indicate a highly efficient program to me but it would be interesting to hear from other sm_21 users that have comparison values for their codes.

@njuffa Thank you very much for your both fast and interesting answer! In the following, I want to comment some of your thoughts:


  1. I checked the machine code as you suggested and yes, you were right: I have only about 22% floating-point instructions (see below).

Integer 0.28
Float 0.22
Load/Store 0.17
Control 0.15
Movement 0.10
Misc 0.04
Conversion 0.04
Predicate 0.01
Texture 0.00
Surface 0.00

Concerning the throughput of the arithmetic instructions that occur most,

MOV 0.09
IADD 0.09
BRA 0.09
FFMA 0.08
LDS 0.06
STS 0.05
FSETP 0.05
ISETP 0.04
SHL 0.04
LOP 0.04
FADD 0.04
FMUL 0.04
IMAD 0.03,

I wonder if the negative effect of IMAD and SHL (only 16 instructions/cycle/SM throughput) can be one major problem.

Furthermore, it is interesting to see that using integer shift instead of 2^n-integer-multiplikation has no positive influence on the performance at all.


  1. Considering the fact that there is a special mix of instructions needed to utilize all 48 cuda cores (sm_21), the achievable performance in practise melts down to 8 SM * 32 cores * 1.65 GHz = 422 GFLOPS. But even with this values, the ray caster reaches only about 51%. Do you think that this gap depends only on 1.?

  1. The term “Execution Dependency Cycles” is defines as follows:

The Execution Dependency Cycles provide an estimate on how many cycles on average each instruction needs to wait, until all dependencies are fully resolved. The distribution between short and long dependency cycles provide further insight as to how often a warp was stalled for more than 32 cycles. These long dependencies are typically caused by dependencies on global memory instructions or texture operations, or stalls at barriers.


  1. What do you mean by “fairly small basic blocks”? As far as I know, the GPU’s limitations are 8 Blocks per SM [me: 6], 1024 thread Block size [me: 256] and 48 Warps per SM [me: 48].

Speaking somwhat simplistically (but this is still fairly accurate for sm_2x): In each cycle, a thread can start either a single-precision floating-point instruction, or a non-FP instruction (such as an integer add). So if only 22% of all instructions are floating-point instructions, you can’t achieve more than 22% of the theoretical maximum FLOPS. Furthermore, the theoretical peak assumes the maximum FMA throughput, where one FMA corresponds to two floating-point operations. Since not all of your floating-point instructions are FMAs, the 17% of peak FP throughput achieved seem entirely plausible and as expected. The low percentage of IMAD and shift instructions implies that they are not an issue.

If I understand the description of “execution dependency cycles” correctly, it is the latency seen on a per warp basis. The basic idea of the GPU as a throughput device is that there are many active warps that cover the latency of each individual warp. In other words, if the number of active warps per SM is greater or equal to the average stall cycles per warp, the machine is kept 100% busy. I can’t tell whether the numbers given by the profiler are determined per SM or per chip, but from my limited understanding there does not seem to be an issue.

Another mechanism for covering instruction latency (used on CPUs and GPUs) is to schedule dependent instructions apart. A “basic block” is, loosely speaking, the straightline code between consecutive control transfer instructions. In general, it is easier to move instructions around within a basic block then across basic blocks, e.g. across branches or function calls. Long basic blocks therefore often lend themselves to better scheduling for latency tolerance (this can be limited by dependency chains, of course). if-conversion that turns branches into predicated execution is one mechanism the compiler uses to lengthen basic blocks. Note that scheduling dependent instructions far apart to increase latency tolerance can drive up register pressure due to longer overlapping live ranges, so there are tradeoffs for a compiler to make that are typically addressed by heuristics.

In practical terms you may want review your code to make sure that it makes appropriate use of “const” and “restrict” modifiers for pointer arguments to allow the compiler the maximum freedom in scheduling long-latency load instructions. See the Best Practices Guide. Note that this may provide no speed-up for a particular piece of code, but I have also seen real-life cases where it provided an instant 10% performance boost.

I have a side question : How did you get the instruction statistics (like above) ? I ran cuobjdump --dump-sass on my executable and it just dumps the whole assembly on the terminal. This info sounds useful to know.

thanks,
Nitin

@nitin.life I used the command

C:\users\user_name>cuobjdump --dump-sass --sort-functions “C:\users\user_name.…\CUDA_Release_x64\CudaCaster.cu.obj” > out.txt

to generate the assembly output file out.txt in C:\users\user_name. Afterwards, I imported this file into excel (using tab delimiters). If you want, you can use my excel sheet (attached): The only thing you have to do is to replace my example assembly code with yours in column D in the first table. Hence, in column A, you will find the different instructions by type and in the second table, you will get the summary of the type’s ratio and the instruction all together sorted by there occurrence.
cuobjdump.xlsx (19.5 KB)

@njuffa

“If only 22% of all instructions are floating-point instructions, you can’t achieve more than 22% of the theoretical maximum FLOPS.”

Is it really possible to map the ratio of the assembly instructions 1:1 on to the number of instructions executed during a real kernel launch? In my case, I have a inner float-intensive loop that computes the intersection tests between the rays and the triangles which is responsible for approximately 50% of the runtime.


“In other words, if the number of active warps per SM is greater or equal to the average stall cycles per warp, the machine is kept 100% busy.”

May I quote some statements from the Profiler Documentation:

“Derived from the dependency cycles and the active warps, the Max Dependency IPC defines an upper bound for the kernel’s achievable instruction throughput.”

I think you are almost right in this point: as I have avg 36.73 Active Warps per Active Cycle per SM and avg 16.42 Execution Dependency Cycles per SM, it seems logical to me that the Max Dependency IPC lies around 2.1, which it does in my case. On the other hand, considering the Profiler Documentation once more:

“For devices of compute capability 2.x, a multiprocessor has two warp schedulers. Each warp scheduler manages at most 24 warps, for a total of 48 warps per multiprocessor. The kernel execution configuration may reduce the runtime limit. The first scheduler is in charge of the warps with an odd ID, and the second scheduler is in charge of warps with an even ID. At every instruction issue time, each scheduler will pick an eligible warp from its list of active warps and issue an instruction. At every instruction issue time, each scheduler issues 2 instructions for devices of compute capability 2.1 for some warp that is ready to execute (if any). A warp scheduler can issue an instruction to only half of the CUDA cores”

If I understand this correctly, with my avg 3.79 Eligible Warps per Active Cycle per SM, I should be able to nearly utilize the two warp schedulers - with 4 IPC (two instructions per scheduler) - if I can reduce the Execution Dependency Cycles per SM so that

(Active Warps per Active Cycle per SM) / (Execution Dependency Cycles per SM)

becomes close to 4. Since the Execution Dependency Stall Reason amounts to 49%, I think your suggestion of scheduling dependent instructions apart may help to solve that problem.

Considering the last sentence of the stated Profiler Documentation quote, I wonder if this does mean that not all instructions are execute parallel in each CUDA core per SM any more, but only for half of them? Otherwise, the second scheduler would have to pick up a warp that behaves equally compared to the first one to utilize all 48 CUDA cores. Furthermore, how can a scheduler issues 2 instructions for 24 cores at all if the max instruction throughput is limited by 1 operation per core per cycle?


“Another mechanism for covering instruction latency (used on CPUs and GPUs) is to schedule dependent instructions apart.”

I understand the idea you suggested, thank you very much!