Clarification on concept to hardware mapping

Hi, sorry for the newbie questions but none of these were addressed in the FAQ or programmer guide. Anyways, I’ve done plenty of multithreaded programming so I’m down with code being execute concurrently and aware of the problems that can arise from that but I’m still having trouble understanding some of the basic CUDA concepts and to me it seems that the sdk is trying to abstract away the hardware, yet there are so many limitations in place that you have to know the hardware to use the sdk. Anways, here goes:

  • why is the warp size 32? To me it seems that since each multiprocessor has 8 processors, 8 would have been a better warp size. Is this 32 threads per warp a universal truth, should I code in that way, will it change, where did 32 come from?

  • if a multiprocessor has 8 processors, I can understand why the max number of active blocks is 8 (each processor is processing a block) but how can the max number of active warps be 24 and the max number of active threads be 768 as stated in the programming guide A.1? Is each processor really capable of handling 96 operations all at once? If so, is there there something lower than the processor that is not being mentioned in the programming guide like each processor has 16 ALUS capables of handling 4 floats or something? From my understanding it goes: a card has 1-16 mps, each mp has 8 processors, each processor can XXX operations in parallel. I understand that with Intel SIMD you can do 4 adds/muls/etcs per op, but 96?

  • does each multiprocessor have a scheduler? why is the max threads per block 512 if there is a scheduler? where does 512 come from if 11 lines down in appendix A1 it says an mp can handle 768 threads concurrently? Or does active not mean concurrent?

Any help would be greatly appreciated. We’re trying to get some computer vision algorithms onto the GPU and really want to use this technology.

And I thought register combiners were a pain…



It’s conceivable that the warp size could change, it as a runtime-queryable property of the GPU device you’re attached to. I’d like a smaller warp size personally…

As I recall, the reason it’s 32 rather than 8 has to do with the hardware being pipelined. I had thought this was in the programming guide at some point, but if not, it’s certainly mentioned in the UIUC ECE498 class notes somewhere. You may find the UIUC notes interesting if you’re curious about such things.

One wants to schedule more than one thread per SP for the purposes of latency hiding…

Not all of the threads on a SM are executing at once but they are “on” the hardware, ready to run when given the opportunity (e.g. while another thread is doing a global memory read)

The word “active” means that the thread state is on the SM in registers etc, but that does not necessarily mean that all of the threads are executing concurrently, some are waiting. As you read more you’ll understand these distinctions better. Perhaps the best way to think about it when writing code is that one warp of threads is running at once, the other threads are on the SM, but waiting their turn to run. Multiple blocks can be co-scheduled onto a single SM, so even though the max number of threads in a block is 512, you could co-schedule up to 768 threads by having 3 blocks of 256 threads each mapped to that SM. The percentage of this “full SM” capacity in use is referred to as “occupancy” in the profiler. You don’t necessarily have to have high occupancy in order to achieve high performance, but it is helpful for kernels that can benefit from overlapping computations and global memory operations.



Wonderful, thanks for the info and the link to the course pages.