Questions on Thread-level resource management

I am using CUDA Toolkit of 10.0 and SM_61

  1. Why is warp-wise memory fence necessary ( assuming non-divergent warp ) if warps are truly SIMT?

  2. what is a “SIMT unit” mentioned in CUDA Toolkit Documentation in PTX ISA section 3.1? Is it a warp scheduler?

  3. The same section of the documentation states that the shortage of registers and shared memory are what prevents multiple blocks from being able to reside in a single SM. Are there any other resources that would cause multiple blocks from occupying the same SM, or are the two the only?

  4. Are compute cores ( CUDA cores, fp64 cores … ) also distributed to each thread within an SM?

  5. When we specify the maximum number of registers per thread on compilation time, are special registers that hold predefined values ( such as threadId ) also included?


  1. a warp-wise memory fence may be necessary when threads in the warp are communicating with each other through shared memory, or global memory. At the C source code level, the need for this is not obvious, because one thread will appear to have written to a location at point A in the execution sequence, and another thread will appear to read from that location later, in step B in the execution sequence. However the compiler is generally free to “optimize out” such loads and stores, and preserve those values (at least temporarily) in registers, which are not visible to other threads. Therefore a memory fence enforces “ordering”, and forces one thread to actually write out the written value to memory at point A, and forces the other thread to actually read in the value from memory at point B, thus making the code work “correctly”. For this to work (i.e. be guaranteed to work correctly), a memory barrier (or other memory ordering mechanism, such as volatile) of some sort would be needed in-between point A and B.

  2. In the future, provide a link for such questions, OK? It makes it easier for me to just click on the link, rather than to have to navigate myself to that section. I think the SIMT unit should be thought of here as everything in the SM that has an administrative function. So that would include the warp scheduler plus anything else that is involved in organizing execution, apart from actual instructions and execution resources (functional units, registers, etc.) themselves.

  3. The SM has various limits. One limit is (for most architectures) 64 warps per SM. This is independent of any other consideration. Another limit is 2048 threads per SM. Again, independent of any other consideration. I don’t know all the limits that may impact the ability for a block to become resident on a SM, but there are more than 2. Registers and shared memory are the primary “resource” limits, but it should be fairly clear that there is some other “resource” underpinning the other 2 examples of hardware limits that I mentioned. It’s just that these limits are not spelled out in detail. The design of an SM is quite complex at a logic level. Most of it is not specified. Enough is or should be specified in order for the programmer to have an effective, coherent programming model.

  4. A thread is an ephemeral software concept. Functional units within the SM are not assigned to a particular thread for more than a single instruction cycle (“issue slot”) at a time. There is no formal partitioning of functional units to threads. The specific design/organization of the SM may in fact mean that a particular warp lane will have a set of “resources” that it typically uses, but this is not specified, and may vary from one GPU architecture to the next. The programming model does not standardize this.

  5. No, special registers are not included in this limit.

Oh sorry for that. Yes, will do in the future.

Thanks again for the great answers!

Upon processing your answer more carefully, I had a few following questions:

  1. From a book on CUDA ( “Professional CUDA C Programming” ), I’ve read that CUDA uses weakly-ordered memory model – as the book defines it, it is a memory model where “memory accesses are not necessarily executed in the order in which they appear in the program”. There is no further explanation in the book about what that means. Do you think your answer #1 is a more detailed explanation of the weakly-ordered memory model, or are they different?

  2. My knowledge of a memory fence is: the calling thread stalls until the memory write is visible to all other threads. The scope of “other threads” depend on the scope of the fence used. What does it mean to “make a memory write visible”? Is it what you have mentioned in #1?

What was described in (1) above deals with the re-ordering of instructions that are independent of each other, as expressed by the code itself. The compiler has no notion of run-time configuration and examines the code assuming it is executed by a single thread. If there is no data or code dependency expressed for particular operations in the code, a C++ compiler is free to re-order these operations including loads and stores.

Reductions in particular often involve cross-thread data dependencies that are not expressible by C++ code alone. Without the addition of explicit fencing or synchronization of some kind, the sequence of loads and stores necessary for proper operation of the reduction code cannot be guaranteed. The use of “volatile” to achieve the desired effect by inhibiting certain compiler optimizations on loads and stores is, in my book, a dirty trick and an abuse of this storage modifier as intended by the C++ designers. It is in common use, however. The cleaner way, IMHO, is to use appropriate fencing and / or synchronization primitives to enforce the required order of reads and writes.

Beyond this, the description of “weakly ordered” in the book is likely a reference to how the underlying GPU hardware deals with memory, and its description is correct in my view. Naturally, only the authors of the book can provide an authoritative answer as to what exactly they meant here.

My guess as to why no detail was added is because a weakly ordered memory model gives hardware a lot of latitude in terms of actual behavior, and the details can change from one chip to the next. The Wikipedia article on memory ordering ( shows that all kind of different design choices are possible. For example, the memory controller in the GPU could change a sequences ld1, st1, ld2, st2 into ld1, ld2, st1, st2 to improve memory throughput, since grouping of loads with loads and stores with stores reduces read-write turnaround when accessing physical DRAM.

What njuffa said. Also:

The best description of the CUDA memory model is contained in this section in the PTX manual:

It is obviously non-trivial both to describe and apply rigorously.

I’d also encourage you to read the relevant section(s) of the programming guide:

as I believe at least your second question is pretty much answered there.

Thanks a lot njuffa and Robert_Crovella! There seems to be a lot to process, so I’ll make sure to take my time and be thorough.