a deep dive into Instruction-level parallelism

  1. does instruction mix have a toll on performance or not? Most articles suggest unless there is register dependency, the scheduler can schedule 2 instructions from the same warp. However, there is one article published at UC Berkeley http://digitalassets.lib.berkeley.edu/etd/ucb/text/Volkov_berkeley_0028E_16465.pdf that very briefly talks about instruction mix (middle of page 24). To quote,

“Another factor responsible for mapping between warps and instructions is instruction mix. It
refers to the proportion in which different instruction types are present in the code. When several
instruction types are present in the same code, which is the common case, executing an instruction
of one type in a warp typically implies not executing instructions of other types in the same warp
at the same time. In result, the number of concurrently executed instructions of any particular type
may be smaller than the number of concurrently executed warps.”

However, the author does not detail regarding what it means to mix instructions, and what the toll is. Any more detail on instruction mix, toll in performance, and how to avoid would be greatly appreciated!

  1. When we say a warp scheduler can schedule 2 “instructions” from a warp in a single cycle, does “instruction” mean assembly-level instructions, or C CUDA level instructions?

  2. I am working to improve ILP of a program. How do you decompose a single line of code that may be composed of multiple instructions? For an example, in the below code, the last line should at least be decomposed to 2 instructions, although it is a single line. But this is pretty obvious since we are dealing with shared memory ( and it would be obvious for global memory as well ). But it is not as obvious for register loads and writes ( for an example, how many instructions is the second to the last line composed of ? ) Is there any literature on that?

// some boiler plate definition of a device kernel
__shared__ float i[1];
float y = 1.0f;
y = y + 2.0f;
i[0] = 0.1f + y;
  1. if you take a look at this keynote, https://www.nvidia.com/content/GTC-2010/pdfs/2238_GTC2010.pdf, on page 35 and 36, instruction-level parallelism increases as consecutive memory accesses grow from 2 to 4 and 8. It makes sense how ILP would improve from 1 to 2 consecutive memory accesses, since a warp scheduler can schedule 2 instructions at once. so, shouldn’t anything more than 2 show no difference? But it makes a significant difference according to the author. How would this be explainable?

Thanks

“instruction” means machine instruction. The ones you see when you use cuobjdump --dump-sass. You seem to be doing a deep-dive on the details of internal GPU execution mechanics. Many of those change between architecture generations and/or are undocumented. I don’t think this is a very fruitful way to tackle performance issues in CUDA programs. Due to the many GPU hardware changes since Mr. Volkov made his presentation, I would claim it is of limited usefulness today.

I see content related to occupancy on slides 35 and 36 of the linked presentation, which is not ILP in the classical sense. In general, when programming GPUs with CUDA

(1) It is moderately useful to worry about occupancy. But given that you have already found Volkov’s work, you should shortly come across his research demonstrating that occupancy is only weakly correlated with kernel performance (same presentation a few slides further in).

(2) It is almost never useful to worry about instruction-level parallelism. Exceptions may exist for ninja-level programmers.

On modern GPUs, many real-world use cases have become memory bound, and some are struggling to expose sufficient parallelism for the ever-increasing number of execution elements. So to first order, the key to good performance is:

(1) Get lots of threads going (tens of thousands)
(2) Optimize data access patterns, eliminate memory accesses where possible

Before you dive into machine-specific optimizations, think about algorithmic optimizations. No compiler and no profiler is going to you help with those. Beyond that, let your optimization efforts be guided by the CUDA profiler. It will identify bottlenecks for you. Start with the biggest bottleneck reported by the profiler. What does it say that is?

Have you read the Best Practices Guide? It contains many ideas on how to improve performance (many of which may not be applicable to any one specific application, of course).

Hi njuffa.

I really appreciate your helpful words!

I have used nvprof to identify the bottle neck for my kernel. I have made sure all my global memory accesses are coalesced and aligned, and I am currently hitting 99.5% for global write and read efficiency. I launch about 512 threads per SM, but the average available warp per clockcycle is only 1 ( so the bottle neck is the lack of available warps). I have further profiled to check that 25% of stall occurs because of register dependency. My occupancy is at 25%, but this cannot go any higher because there needs to be at least 128 registers per thread. So the only way for me to increase the number of available warps is to increase ILP ( or do you have any other suggestions for me? ).

Although this may not be the best practice for generalizability purposes, it would be great if you could share anything you know regarding the questions I had :)

While one shouldn’t obsess about occupancy, 25% occupancy would give me reason for concern if this were my code. 128 registers per thread is unusually high register pressure. Have you looked into what is causing that? Smaller thread blocks can often lead to better utilization of hardware resources, which is why I advocate block configurations of between 128 threads and 256 threads as a starting point.

Have you looked into the use of __launch_bounds() to reduce register usage somewhat? The expected response from the compiler is that it first reduces optimizations with the potential to drive up register pressure, and if the bounds become even tighter, starts spilling registers to memory. It usually does so somewhat intelligently, such in outer rather than inner loops, or by vectorizing spills and fills. The cache structure of GPUs typically allows four to six registers per thread to be spilled without impact on performance. If spilling becomes more severe there will be noticeable negative impact on performance.

Yes, I have tried to decrease register to 64 and 32 using “maxreg” flag, and unfortunately it had a huge toll on performance.

This is interesting. could you share how this is possible or share any literature on it?

Thanks!

When experimenting with __launch_bounds() (or -maxrregcount, if the code can tolerate this being imposed on the entire compilation unit), you would want vary the target in much smaller steps. In this case, where the starting point is 128 registers per thread, I might try reducing by eight registers as a time to see what happens.

Small reductions in register count per thread are usually more effective when thread block size is small (due to finite granularity in the allocation of hardware resources).

You might consider that this is plausible by dividing cache size by number of threads to derive the number of bytes a thread could keep in the cache. As for the four to six spilled registers per thread often being harmless, that is an observation from experimenting with __launch_bounds(). Note: “often”, not “always”.

In practical terms, the key to getting better performance out of CUDA code is not so much literature (as NVIDIA keeps much of the architectural details a secret, and they tend to differ from architecture generation to architecture generation), but experimentation. The GPU execution model has more configurable parameters than the typical CPU execution model, e.g. you don’t get to pick the number of registers to be used when running with a CPU. That’s both a blessing and a curse. Consider adding auto-tuning to your software to find the best configuration parameters for a given GPU.

Jumping from 128 to 64 is a really big jump. Most GPUs nowadays have 64K 32-bit registers per SM. At 128 registers per thread, the maximum occupancy is 512 threads. Getting down to 64 would potentially double the occupancy, of course, but is probably an unrealistic goal (while preserving/improving performance), if the only tool is __launch_bounds

Using launch bounds, I would try a number of steps which would allow improved occupancy, to see if any give perf benefit.

Below 128 registers/thread, switch to 128 threads/block, and consider trying 102 (for 5 blocks occupancy) and 86 registers/thread limit (for 6 blocks occupancy). You could do a finer-grained experiment at 64 threads/block, or even 32 threads/block.

Thanks a lot njuffa as always. I will experiment with __launch_bounds() and try to decrease register at a slower rate.

Thank you! I’ll give it a shot.

I tried an 8-way ILP ( i.e. consecutive 8 independent memory / arithmetic instructions ) and the throughput increased to 2.7 times the original ( with the same occupancy )!

I have nvprof-ed and found that instructions per cycle increased from 1.16 to 1.42.

The author did write that ILP should work for all CUDA GPUs moving forward, and it seemed to have been the case at least for my GTX1080 :)

I have no idea what this means. Could you show a comparison of what the code looked like before and after?

Each block in my code used to produce a sound sample, one sample per iteration. I changed this s.t. the number of sample it produces per iteration can be given as an input in C++ template style. The peak performance was found at 8 samples per iteration ( i.e. instead of producing 1 sample per iteration, it produces 8 samples ). I hypothesize there were two major reasons for the perfromance gain: 1) the number of times that the weights had to be loaded decreased to one-eight, and 2) 8-way ILP allowed for higher number of instructions per cycle.

This is a piece of code that stores the produced value to the global memory.

Xt[(num_layers+1)*R + (layer+1)*R + row] = accum;

I changed to

#pragma unroll
        for (int u = 0; u < I_UNROLL; u++) {
            Xt[u*(num_layers+1)*R + (layer+1)*R + row] = accum[u];
        }

Notice that “accum” that used to be a single register is now an array of registers that hold 8 values produced by a certain layer of the generative model. The independent memory transactions increased from 1 to 8, which seems to help hide latency and increase throughput even when occupancy stayed the same.

More likely than not (you should be able to confirm with the help of the CUDA profiler), this change

(1) reduced dynamic instruction count by allowing all but the first address to be computed by simple addition
(2) allowed the compiler to schedule loads earlier and batch them, increasing latency tolerance and improving the efficiency of memory accesses

I would not classify this as an ILP-related technique (which is why I asked), but I guess one could argue about the exact definition of that. To first order, a GPU is a scalar processor which can schedule one instruction per thread per cycle for execution, i.e. the question of ILP doesn’t even come up. There are various exceptions to this first-order description which vary by GPU architecture, but I am not aware of any GPU that shipped since CUDA came into existence that issues more than two instructions per thread per cycle under any conditions.

If someone has better information, I encourage comments on this. There have been too many different architectures to keep all details in my memory.

I have no idea why any consecutive independent instructions more than 2 would improve performance for the same reason you have mentioned. However, I followed the methods that the author suggested would increase ILP, and the result seemed to have followed the trend that the author described as well, and I was ( maybe dangerously so ) assuming he was right and there is something I did not know about the architecture that allowed such improvement ( which is why I asked about this above ). While I can’t endorse the author’s claims, I feel I can’t simply brush it off either.

  1. How does reducing instruction count increase performance?
  2. What does it mean to schedule load “earlier”?
  3. Why would the above two reasons be a bigger factor than increased ILP? I understand that since the warp scheduler can only schedule 2 instructions from each warp per cycle, we shouldn’t see any performance increase for more than 2 consecutive independent instructions, but couldn’t there be some hardware detail that we are overlooking?

Thanks again for your insight :)

I’m not aware of any CUDA warp scheduler that can exceed dual-issue.
One of the canonical ways to tackle latency bound codes when increasing occupancy is not an option is to increase the work per thread. This seems to fall into that category.

Regarding at least question 2 above, this answer may be of interest:

https://stackoverflow.com/questions/43832429/is-starting-1-thread-per-element-always-optimal-for-data-independent-problems-on/43833050#43833050

I don’t think work per thread increased. The number of instructions per warp is roughly the same. If each thread had to iterate 8 times, it now only iterates once, but does the work of 8 iteration on 1 iteration. so the net work that each thread does stays the same.

(1) If a system has a given throughput in terms of number of instructions per unit of time, reducing the number of instructions that need to be executed to accomplish a given task will tend to reduce run time. Let’s assume in your case each load requires twenty instructions for address computation. So eight loads require 8 * (20 + 1) = 168 instructions to be executed. If you batch them up as shown, after the initial address computation, all other addresses can be accomplished by addition, taking only 2 instructions. Now we need 7 * (2 + 1) + 1 * (20 + 1) = 42 instructions to accomplish the same task.

(2) Within a block of machine instructions, instructions can be moved around without affecting the result as long as data dependencies and control dependencies don’t preclude that. In order to increase latency tolerance, the compiler will attempt to move long-latency instructions to the start of a block. Load instructions are known to the compiler to have long latency, so it is eager to place those early in the block. The longer the block of instructions, and the more independent instructions there are, the easier this is. Furthermore, mixing loads and stores can create inefficiencies with DRAM storage. Batching loads with other loads can lead to better utilization of memory access path (from the load/store unit down to the actual DRAM interface). On the other hand, stores are often “fire-and-forget”.

(3) To the best of my knowledge, on those GPU architectures that provide limited dual-issue capability, it can be exploited in only a relatively small number of cases. In other words, it can be quite difficult to construct code that tries to purposefully exploit these capabilities. It is for this reason that I maintain that worrying about ILP is not a very fruitful endeavor on GPUs. It is not one of the main performance-relevant kinds of parallelism on GPUs.

That was really helpful. Thank you.