Running One Thread per CUDA Core Or at least allow for highly divergent, register-intensive kernels


I’m in a situation where I would like to use the parallel nature of a GPU to perform high-fidelity propagations on several items at the same time. Unfortunately, “high fidelity” means very high register counts (>50) and divergent threads due to the application of different regimes when certain criteria are met (for example, going from a laminar to a turbulent fluid flow model when Reynolds number is greater than 5e5).

I’ve been coding CUDA for a few months now, and usually I’ll find ways around this by either implementing intricate 3D texture lookups for data that can be pre-processed, using shared memory for somewhat constant parameters, or by splitting up the calculation steps so that I launch kernels that perform a single step one after another, etc. However, in this case, that’s not possible.

One approach I have used is simply forcing each stream processor (ie each CUDA core) to process a single thread by having every thread that is not a multiple of 32 return immediately. This method results in only one active thread in each warp, so there is no serialization within the warp (serialization is BAD). However, this only resolves the issue of divergent code, and doesn’t alleviate the register usage issue.

The way I see it, if there was a way to configure the kernel launch such that only one thread is run per stream processor, then everything would be fixed for me, since for Compute Capability 2.1 this would amount to ~660 registers per thread (32000 registers per multiprocessor / 48 threads per multiprocessor) and the threads can be as divergent as they need to be since there is a separate processing unit for each thread.

So, my question is, for times when high register counts are unavoidable and there is going to be divergent code, how can I implement this in CUDA and get the performance I know and love?


EDIT: I only need to propagate a few hundred items, not thousands.

You are aware that by doing this you get only 1/32th of the GPU’s computational power? I.e., performance is going to be abysmal compared to a multicore CPU implementation. Also you seem to have a slight confusion about stream processors / CUDA cores and streaming multiprocessors (SMs). A streaming multiprocessor consists of 8, 32 of 48 stream processors / CUDA cores.

Running just one thread per warp does not “solve” the divergence issue. Sure, the divergence goes away, but even fully divergent code with 32 threads/warp will still be at least as fast as your single thread per warp solution.

The best you could do, I think, is to dynamically rearrange the threads (move the data to new threads) to have consecutive threads take the same branches, which would minimize divergence. I believe Nvidia is doing this in their demo raytracer. Googling for “cuda persistent threads” might bring up some info about this.

Thanks for your response. I’ll have to look into the raytracing material a little more; I’m unfamiliar with raytracing as a whole so I’ll need to brush up on that to understand what’s going on and what the problem is that they’re trying to solve.

I’m not sure that this is necessarily true; even if there is only one active thread per warp, that by definition means you have one active thread per CUDA core. I am developing with a Quadro 2000, so in my case there are 48 SPs per SM and 4 SMs, so under the scheme of one active thread per warp I can run 192 propagations at once, which is much more than a CPU despite the CPU’s higher clock speeds. That, coupled with things like texture memory which does 1D, 2D, and 3D interpolations faster than the CPU, solidifies the GPU’s superiority to the CPU in this case.

In my case I’m not doing thousands of propagations, so I don’t need to leverage my GPU’s ability to crunch 6144 threads at a time. I’m basically trying to trade the ability to process lots small computations (ie lots of non-divergent threads with low register count) for processing fewer and larger computations (ie less threads with divergence and high register count).

This is not the way that CUDA devices function. Although CUDA uses multicore CPU-souding terminology, like “threads” and “cores,” the hardware actually executes in a SIMD fashion (like a supercharged version of SSE). The warp of 32 contiguous threads is the indivisible unit of execution. When divergence happens, the warp must be run multiple times (once through each branch) with the appropriate thread slots masked out on each execution. If some threads in the warp actually terminate, then those slots in the warp are masked out permanently, but the warp continues to run (as you would expect). If all the threads in a warp terminate, then the warp is taken out of the scheduling pool.

Your Quadro 2000 is a compute capability 2.1 device, so the 48 SPs per SM are organized into 3 groups of 16. Each group of 16 SPs is designed to complete the execution of a warp instruction (except for a few instructions that take longer) every 2 clock cycles. This gives you a peak throughput of 3 warps every 2 clocks (although the scheduling details of compute capability 2.1 mean that you usually get closer to 2 warps every 2 clocks on average). If a warp only has 1 active thread, then the other 31 slots in the warp are unused and just bubble through the pipeline doing no work. The hardware cannot fuse single threads from different blocks into a single warp, or do any other trickery to dynamically remap threads into full warps. (Although, it could be really neat if future hardware did this!) As a result, with one thread per warp, your per-thread throughput is 1/32 of the peak throughput.

Another way to think about this is that with one thread per warp active, your GPU can only do 12 single precision multiply-add instructions per clock cycle (and probably more like 8). Given that your GPU has roughly half the clock rate of your CPU, that is unlikely to be competitive with CPU code. (However, short, inefficient CUDA calculations can be useful as part of a larger processing chain to avoid the overhead of GPU-to-CPU memory transfer.)

Thank you Seibert for the clarification. I suppose my (faulty) understanding on how this works stems not only from the similar terminology, but also from how the math seems to substantiate it: for my Quadro 2000 there is a max of 1536 threads that can be run on each SM. There are 48 SPs in each SM, so dividing 1536 threads by 48 SPs yields 32 threads per SP, which just so happens to be the warp size. Since warp size seems to be constant for all compute capabilities, I just supposed that there was some sort of phenomenon that I was unaware of that empowered the SP to do 32 calculations all at once…the partitioning of the SPs so that they do a half warp each clock cycle makes a lot more sense!

In light of how the GPU actually handles the processing of warps, do you have any ideas on how to go about laying out a kernel with unpredictable divergence and high register counts? As I’ve said before, I’ve tried to do things like split up the computation, pre-calculate where possible, etc but I still have the divergence and high register count. I would like to avoid using launch_bounds to limit register usage because in my experience it seems to hurt more than help, but perhaps it can be used in another way to actually make things better.

EDIT: I guess it seeems somewhat silly to me to have all of these independent SPs (well, I assume they’re independent) without the ability to configure kernels to run such that each thread is run on a single SP, allowing the equal distribution of registers to each SP. While this may not be ideal for the majority of applications, I think this would be the best configuration for kernels that must have high levels of divergence and register usage.

The SPs aren’t independent. All SPs of one SM (streaming multiprocessor) share the same control logic, which forces all threads of a warp to execute the same instruction at a time.

Conditionals are handled by sequentially executing both branches of [font=“Courier New”]if () … else …[/font] clause and discarding results on the non-applying branch (unless all 32 threads of the warp take the same branch, in which case the untaken branch is not executed). In the worst case (all 32 threads of a warp take different branches), this means that 31 results are discarded and just one is kept. However, at least the code to evaluate the condition is the same for all threads and executed in parallel. So even in the worst case of fully divergent warps this is still faster than your solution where with just one thread per warp, where 31 threads just run empty (always discard their result).

If registers is your only problem, you could look at AMD hardware too.
They have 16K vector-registers per CU. Each vector-register can hold 4 floats together…
So, thats like having 64K Scalar Registers capable of holding a “Float”…

But that said, their shared memory is very very slow. About 7x slower than Register Reads.

That is an impressive number of registers, but considering my environment is already set up for CUDA, in addition to the fact that I have coded many other functions in CUDA, I’m going to stick with the setup I’ve got.

tera, I would think that there is some degree of independence between SPs; if this wasn’t the case then you’d have to worry about divergent code between warps instead of only within a warp. I suppose, however, that the scheduling entity would issue instructions to 16 SPs at a time (in accordance with seibert’s post), so if that is the case then there would only be a semblance of independence between each group of 16 SPs.

But that’s all beside the point; the main question I had was how to go about running divergent and register-hogging kernels. I was hoping that, despite such kernels not being very good practice in the world of CUDA under the prevailing paradigm, there would be a way to do this. My initial idea was to micromanage each SP such that it would dedicate itself to executing one thread at a time from start to finish. If this was possible, you’d have a gratuitous amount of registers for each thread, and the fact that there is a separate processing entity executing each thread means that divergence would not be an issue. But, it seems like the instruction scheduling/issuing entity might be to blame for the infeasibility of this idea.

What if

__launch_bounds__(32, 1)

was used? For CC 2.1 and assuming adequate shared memory usage, the occupancy would be bounded by 8 resident blocks per SM, allowing a whopping 128 registers per thread. I was also considering

__launch_bounds__(16, 1)

in an attempt to effectively reduce the warp size but I think the card would still default to 32 threads/warp and so you’d have idle SPs every other clock cycle. I guess you could still use that if your code is super divergent and you wanted to further reduce serialization in exchange for less threads being run at a time.

That scheduling entity indeed exists. It consists of 32 threads and is called a warp.

Whether the 32 threads execute on 16 SPs that are ‘double-pumped’ with regard to the instruction dispatch logic (or on 8 SPs that run 4× as fast as instruction dispatch) or not is an invisible implementation detail. Fact is that there is only enough hardware to decode one instruction per 32 threads. Think of SMs as ultra-wide SSE (or MMX or Altivec or SIMD) units with a few additional quirks to temporarily disable some of the scalar elements so that the illusion of a conventional programming model can be maintained (unless you are concerned about performance).

The main advantage of GPUs over CPUs comes from the fact that GPUs save on the control logic (and caches) and spend their transistor budget on additional arithmetic units instead. (Another advantage comes from the ‘hyperthreading’ on the GPU being much more advanced than what Intel ever managed to do).

How do you intend to ‘micromanage’ the individual SPs? I think you are mistaken here by Nvidia renaming the SPs, which really are just ALUs/FPUs, to ‘cores’. And I fully uphold my rant from another thread that this was done by Nvidia’s marketing department to intentionally mislead customers.

[font=“Courier New”]launch_bounds()[/font] is just a hint to the compiler how you intend to launch the kernel and how many registers it may thus use per thread. It is indeed the way to go if excessive register use limits the occupancy of your kernel, but it does not affect the problem of divergence.

The blocksize in the [font=“Courier New”]launch_bounds()[/font] directive does not change the execution configuration, that is still governed by the value inside [font=“Courier New”]<<< >>>[/font] at kernel launch (if you use the runtime API).

Even divergent code is still more efficient than not using the remaining SPs at all.

My approach to dealing with divergent warps:

  1. Avoid divergence if it is easy. Sometimes a simple reorganization of the assignment of data to threads is sufficient to avoid the problem.

  2. If it isn’t easy, just embrace the divergence and benchmark the result. CUDA degrades pretty gracefully with divergence, so you really want to see how slow it is before you expend a lot of effort avoiding it. I have plenty of suboptimal CUDA code that is “good enough” in the context of the program it runs in.

  3. If the divergent code really is a significant bottleneck in your performance, go back to 1 and think harder. You may have to change up your algorithm completely rather than try to patch your existing one. If there is no good solution, then I just give up and stick with CPU code.

This isn’t really earth-shattering, but a lot of people want to skip directly from #1 to #3 without measuring how severe the problem is for their application first.

slugwarz, i think your approach(using only threads with threadIdx.x%32==0) might improve performance for the corner case you describe(highly divergent, many registers, only hundreds of items).
Mainly because the SPs are severly underutilized anyhow(lots of idling because only 100ths of items, you need multiple items per SPs to hide alu and memory latencies).

However, there is no way to get access to registers of the unused threads(afaik), and i doubt this will change anytime soon since your problem is very specific(not really suited for GPGPU maybe?).
You could use shared memory to manually spill registers. Not as fast as registers of course, but better than automatic spilling to local memory in this case(since the local memory access pattern is optimized for all threads in warp accessing the local memory in a burst).

Two alternative approaches i can think of:
1.) Reorder temporary data based on branching outcome(to 2 shared memory arrays maybe), then process BranchA with DataA, BranchB with DataB to avoid warp divergence.
2.) KernelA writes temporary data to mapped host mem, handle serial cases by CPU in mapped host mem. KernelB uses the CPU processed data again.
Both alternative approaches might only work when little temporary data is needed.

Thank you everyone for your help. I think I can make something work with what I’ve gleaned from this dialog. Also thanks to tera for clearing up the terminology misunderstandings over the use of the word “core” when “ALU” is more appropriate.