GPU Code Distribution

How does the GPU distribute machine code commands to all the threads? Every thread has to be aware of what it needs to run; what feeds these commands? How/When do command(s) get sent from RAM to the GPU; where are these command(s) initially stored?

In general, I would like to have an understanding of how an assembly command goes from a single chunk of memory to being executed on by multiple warps with multiple threads in each warp. The reason for this desired understanding is to answer questions like these:

  • Does every thread get its own copy of constant values in assembly commands?
  • OR does every warp have a single Program Counter which handles all the child threads?
  • Is there a decent overhead on a per machine code command basis; for distribution of the machine code?

I have tried looking up some resources but most fall short of really explaining the execution distribution process. If anyone has a good link or any good resource, please do share! Please note, while interesting, my main focus is not on the compilation of CUDA code but rather what the GPU does with the final result of CUDA compilation (machine code).

Thanks for your time!

This level of detail is not explained in great detail by NVIDIA, but the first thing to appreciate is that the terms “thread” and “core” are used in CUDA differently from the traditional pthreads-style programming model.

I find it much more useful to think of the GPU hardware as a multicore SIMD where the “streaming multiprocessor” plays the role of the “core” you would find on a traditional CPU architecture, and each “CUDA core” is simply a pipelined ALU for one element of an SIMD word. The design of each multiprocessor is such that it has multiple instruction decoders and can have multiple instructions in flight.

The CUDA programming model lets you program this SIMD hardware with a thread-like software abstraction, which makes for much more readable code and control structures, but the abstraction is kind of leaky. Code with high levels of branch divergence within a warp will perform very poorly because of the underlying SIMD nature of the hardware.

So, to answer your specific questions, the best I have been able to deduce from the manuals, whitepapers, thread posts, etc, is that the machine code for a kernel lives in the device memory, and is fetched by multiprocessors through several layers of cache at the GPU and multiprocessor level. Each warp has its own program counter (not each thread), and is assigned a bank of SIMD registers from the register file when the kernel starts (size depends on the specific kernel). I assume the caching mitigates the effect of memory latency and bandwidth on execution quite effectively, mostly because I have never seen someone run into instruction bandwidth as a specific bottleneck. I think that would require a very special kernel that is extremely long with no loops and and purely simple arithmetic instructions.

If you want to learn more about the hardware model for CUDA, it is useful to read through the PTX manual. PTX is not assembly language for the GPU hardware, but rather assembly language for a virtual GPU that can be translated by the GPU driver (or PTX assembler) into machine code for the particular architecture of your specific GPU. Nevertheless, PTX is designed to mimic the structure of the underlying hardware, so it gives some insight into how things are organized.

Each SMP is a superscalar vectorial chip, similar in a way to the old Crays supercomputers.

Thank you so much seibert for the detailed reply and perfect answers. I have already begun looking up more info on PTX!

Also, thank you pasoleatis for steering me towards more info!

Once you have some familiarity with PTX, you might also find the cuobjdump tool interesting. This decompiles the GPU machine code from a compiled kernel, showing you the actual instructions produced by the PTX translation step.