What can be learned from IPC (via nvprof)?

After profiling a set of kernels which run concurrently on two GPUs, I took notice of the IPC count (computed through nvprof)



sample output;

NOTE: need to scroll right to see the relevant values.

==5924== Metric result:
==5924== Metric result:
Invocations                               Metric Name                        Metric Description         Min         Max         Avg
Device "GeForce GTX TITAN X (0)"
        Kernel: compress_y_half(float const *, __half2*, int)
          1                                       ipc                              Executed IPC    1.243167    1.243167    1.243167
        Kernel: sum_buffers_512(float4 const *, float4*)
          1                                       ipc                              Executed IPC    0.137906    0.137906    0.137906
        Kernel: simple_back_512(float const *, __half2 const *, float2*, float, float, int, int)
         64                                       ipc                              Executed IPC    2.879317    3.224608    3.049561
Device "GeForce GTX TITAN X (1)"
        Kernel: compress_y_half(float const *, __half2*, int)
          1                                       ipc                              Executed IPC    1.162051    1.162051    1.162051
        Kernel: simple_back_512(float const *, __half2 const *, float2*, float, float, int, int)
         60                                       ipc                              Executed IPC    2.807879    3.147826    2.962747

What operations are counted towards that value? Is it just floating point operations or do other integer operations also count?

So is this a truly relevane metric of performance?
If so are are such values for my primary workhorse kernel as good as they seem (3.04 average for the faster of the two GTX Titan X GPUs), or is there some other caveat to such simple conclusions?

I don’t know how GPUs count the “I” in IPC, but typically IPC is computed across all executed machine instructions, although the next question is typically whether this includes instructions that were predicated off :-) If the NVIDIA documentation doesn’t specify this in detail, it may be difficult to find out.

There is no direct connection between IPC (instructions per cycle) and performance. The reason is that a shorter dynamic instruction sequence is not necessarily the faster one, depending on the latency of individual instructions, internal resource contention, scheduling constraints, etc. Note that the cycles will also be influenced by stalls in the memory hierarchy for example. That said, significant differences in IPC could be indicative of somewhat correlated performance differences.

IPC is most useful to processor designers for measuring architectural efficiency by measuring the IPC of fixed sequences of machine instructions, which traditionally used to be recorded in the form of instruction traces, although that has been largely replaced by execution-driven simulation. Since GPUs do not provide binary compatibility (e.g. integer multiplication is a single instruction in sm_3x but a multi-instruction emulation sequence in sm_5x) the utility of IPC is probably limited to examining architectural variants within the same basic architecure generation.

I use it quite a bit as my kernels are typically designed to be full utilization. This means I should get an IPC of 4 (one instruction per scheduler per clock) plus however many additional instructions are dual issued. If I know the exact ratio of dual issued to single issued instructions I can get a pretty good indication to how close I am to full utilization. sm_efficiency should also give me similar information but it’s not clear to me how it’s calculated so I prefer to look more at IPC.

As far as I can tell predicated instructions behave like normal instructions in every way except that their access to the register bank is gated. So they tend to consume less power.

It used to be the case on NVIDIA GPUs that instructions that are predicated off went through the entire execution pipeline but where killed before register write-back if register-to-register instructions, and killed after address computation but before memory access for load/store instructions (so an instruction that is predicated off with an invalid address won’t raise an out-of-bounds exception). Based on what you say, that still seems to be the case. Since the instructions get killed very late in the pipeline when predicated off, I would anticipate power savings to be modest.

Have you specifically looked into whether instructions that get predicated off get counted for IPC? One reasonable hypothesis would be that IPC is computed based on retired instructions, rather than dispatched instructions, so instructions dispatched but killed in the pipeline would not count. But the alternative design, i.e. base IPC on dispatched instructions, is also possible. Performance counters are often grabbed where it is most convenient to extract them.

Except in narrowly confined circumstances (which may apply to hand-coded Maxwell code), utilization is not directly correlated to performance. This is frequently observed with complex processors, be they CPUs or GPUs. In my GPU optimization work I have quite frequently encountered the following scenario: I reduced the number of instructions by clever recoding but achieved no decrease in execution time, that is, IPC and utilization were reduced but performance remained constant.

The reason for lack of speedup was that while I reduced the number of instructions to accomplish a particular task, my code was still limited by some internal bottleneck (quite possibly the same one), and identifying that bottleneck in the absence of a detailed microarchitecture description was difficult enough that I usually just kept recoding the same stretch of code until I identified the “winning combination”.

Just setup a simple micro bench with all false predicated instructions and they do in fact count towards IPC. warp_execution_efficiency/warp_nonpred_execution_efficiency is what you want to look at if you’re concern is with predicates.

With predicated memory accesses, they seem to enter the pipeline just as normal instructions, just that if the predicate is false and warp uniform, their latency is much reduced (like 40 clocks instead of 200 for a global load).

One thing I’m still not clear on is the difference between executed and issued IPC. If predicates are not the difference there then what is? The differences I see in the two values are typically tiny.

Hard to tell. On CPUs, one typically distinguishes between “decoded”, “dispatched”, “retired” depending on how early in the pipeline the instructions are counted, and the counts are “decoded” >= “dispatched” >= “retired”.

Leaving predication aside, instructions may be killed due to branch mispredictions or exceptions, for example. So one might find that some code, while not yet limited by exeuction resources, is maxed out in decode, the difference being due to branch mispredictions.

To save hardware GPUs do not perform branch prediction, they assume straight-line execution. That means that a loop-closing backwards branch will introduce a fetch bubble. Whether that has an impact on either of the two GPU counts you mention might be distinguishable by constructing loops with ever larger loop body.

You got it. That’s exactly the difference. The bigger the loop the smaller the gap. And I tend to spend most of my kernel time in large loops so I mostly see only tiny differences between issued and executed IPC.

Executed IPC is the count of instructions executed (guaranteed to retire) per active cycle. The SM is considered active if at least one warp is allocated on the SM. Fully predicated off instructions count as executed instructions.

Issued instructions >= executed instructions. On Kepler issued instructions were often much greater than executed instructions because l1 divergent accesses, l1 vector accesses (128b), l1 load misses, and shared memory bank conflicts were handled by replaying the instruction. On Maxwell replays are handled further down the pipe by the texture unit and shared memory unit.

Replays can occur for architectural reasons. The reasons are different between Kepler and Maxwell and are not easily actionable by a developer.

I’ve definitely noticed the lack of replay overhead on maxwell. This makes the vector instructions a lot more useful.


So as Pascal has 4 scheduler per SM, when we get something like 3 as the IPC, does it mean 3 instructions where scheduled by the SM in a cycle?
Also, does NVPROF consider an average of the IPC of all SM for that Kernel and report it?
if one SM has an IPC of 3 then a GPU with 2 SM should give us an IPC of 6, right?
Also, I am working with a simulator called as GPGPU sim, which reports IPC in higher ranges (80-120)
I assume that they are calculating IPC per core but I am not sure?
Can someone please help me with this?