Once again for registry spills, performance and nvcc magic

Okay.
I have some pretty big CUDA app here (thousands of lines).
After months of development, I have noticed that the performance of the app went down (up to 30% slower).

nvcc output shows that there are registry spills (there is a function with 2000 bytes load spill, many functions with up to 200 bytes load/store spills).

Is there a way, a paper, or a tool that can tell me which lines of my code are causing that ?
Let say that even I have read some of GPGPU & CUDA books, I still have absolutely no idea how to deal with this problem. I need to understand it better, to be able to find the issue and to know to fix it.

Lets say that I have tried almost everything I found on the internet - maxregcount, load_bounds, inline and noinline, volatile, restrict, bitfields instead of bools/ints, C-style flags, all of that in all configurations possible.

When the app was smaller that seemed to help, but not anymore.

Yet, these methods are pure magic, one have to add use them with every configuration possible, compare results and see what works best ?

Any help will be greatly appreciated.

Thanks,
Bye.

You really really want to avoid register spills — especially if they’re occurring in a performance sensitive part of your kernel.

One workaround is to move to an sm_35 device since kernels of up to 255 registers per thread are supported.

The approach I use to solving unexpectedly high register counts is, unfortunately, inspection of the code and then generous application of hacks.

Some hacks that I’ve used are:

  • remove registers that are holding values that can be easily recalculated on-demand (e.g. a function of the thread index)
  • compact two or more small values into a short2 or char4 vector type. Indices into shared arrays and small counters are good candidates.
  • save expensive scalar calculations to shared and perform a broadcast read when the value is needed
  • make sure you're giving the C preprocessor and compiler the opportunity to aggressively expand/collapse/fold away constants and code
  • avoid arrays with local extent — they're the worst! If the compiler can't resolve the indices to constants then you're going to be using local memory (spilling) even if there are registers available.
  • redesign your algorithm
  • (anybody else have some hacks?)

A few years ago there was a tool called CUDAvis that visualized your kernel’s register pressure.

I don’t know if that tool is still available or if there are alternatives but I wish it were part of Nsight for Visual Studio.

A few months ago I filed an RFE (285288) to enhance “nvdisasm” to emit a running total of SASS registers in use. I’ve attached an illustration of how to calculate the running total from SASS output.

Someone could probably implement this with a script. :)

Consider splitting larger algorithms into multiple smaller kernels, to be executed in sequence. There is added overhead for storing intermediate state to global memory and reading it back, but the smaller kernels may execute so much faster that there is net gain in speed.

Thanks for the replies !

However, I have found a .pdf paper about CUDAvis, but I didn’t manage to find the tools itself. Anybody has it somewhere (google search didn’t help much).

If the maxregcount is >63, then there are no spills, but the performance is terrible (even on high-end 3.5 cc GPUs). Go figure.

I tried that too. Store spills were reduce, but load spills were much more …

The nature of the application does not allow using shared memory (nothing is really shared).
But I do use float2/3/4, etc, whenever it is possible.

Not sure how to give them such opportunity … I will be glad if you can explain in more details …

I had such problem. I have some relatively big constant array (marked as such). Indexing with a variable number result in massive spill. I tried almost everything (including using it from a global memory) but nothing seemed to help. Any idea if it is possible to use such and not have spills ?

It is a whole application (psychics simulation) rather than algorithm. I don’t think redesign is possible. It is designed from scratch to be GPU friendly and I don’t see how starting from scratch will result in something different than it is today. For procedural noise for example, I need constant big arrays. There is just no other reasonably fast way to do it …

It was written like this in first place. Than again, the compiler can’t optimize stuff between kernels and the performance was times slower.

Do you mean you have a constant array or a locally declared array? Are all lanes in the warp indexing the same array address?

It is hard to make recommendations just based on a brief description and without hands-on assessment. One issue that I have seen in the past is that if there is a lengthy block of straight line computation pulling data from many different locations, the compiler will try to cover load the latencies better by pulling out most or all of the loads to the front of that block of code. Generally a good idea, except when it eats up too many registers …

Could the constant data be replaced by literal constants, with templated functions taking the place of indexing? Could the constant data instead be placed in a texture? On a Kepler platform, you could also try just using global memory for storage and use LDG to load the data through the texture path (doesn’t require any explicit binding of textures).

Large constant tables can easily exceed the size of the constant cache, causing a lot of memory traffic. Have you tried an alternative that relies on computation (noise generation → PRNG ?) instead? FLOPS are almost “too cheap to meter” these days, memory access remains relatively expensive even on a high-bandwidth device like the GPU.

To have an idea of which lines of the code are causing register spills, what about hacking the PTX code?

Up to CUDA 4.0, there was the possibility to get an annotated PTX code from the compiler output. Correct me if I’m wrong, but with the new LLVM backend, this is not possible anymore. Nevertheless, you could manually annotate the PTX code by inserting inline PTX comments with a syntax like

asm volatile ("// code at this line is doing this and this ...");

Concerning the mentioned use of maxregcount, my understanding is that it will increase register spill instead of reducing it and it is used to hide latencies by using more threads per multiprocessor.

Here’s how maxregcount works. It tells the compiler to use no more than this many registers. This doesn’t necessarily mean that it will cause more spilling. This is non-intuitive, but the main reason is that the compiler can do things other than spilling to reduce the number of registers used (for example, trading ILP for fewer values with overlapping live ranges).

Normally the compiler will try to pick a good value for a register target, but this is based on a heuristic, and you can use maxregcount to override the heuristic. You can also exhaustively explore the space by trying multiple values for maxregcount and picking the value that gives you the best performance.

Yes, it is constant. And no, they are not indexing the same values.

Templates are options, but since this code is running on OpenCL too, I haven’t consider it (but I can ifdef that, of course). However, PRNG sounds like a good idea, thanks. Yet, it remains magic how to find the most registry-wants lines of code. Still not quite good reading .ptx.

Thanks for the help guys ! I really appreciate it.
But still, I believe strongly that nVidia should do more and better tools for the developers if they want CUDA to be used for more complex (in terms of both algorithms and lines of code) applications.

Just be aware that constant memory throughput is only one word per clock. So if all your lanes access the same constant then it’s one clock but the worst case is 32 clocks if each lane accesses a different address in the .const space. The worst case is quite bad.

Using a texture or shared memory might make more sense.

Alternatively, if the size of your randomly accessed constant array is 32 elements or less you could even use the sm_35 shuffle operation to select a constant from a dedicated register.

This is also useful if your constant array elements are multiples of 32-bits.

You can extend this technique beyond 32 elements by using SELP or PRMT ops before performing the SHFL.