Clarification on concept to hardware mapping

gz1 · January 11, 2008, 3:00am

Hi, sorry for the newbie questions but none of these were addressed in the FAQ or programmer guide. Anyways, I’ve done plenty of multithreaded programming so I’m down with code being execute concurrently and aware of the problems that can arise from that but I’m still having trouble understanding some of the basic CUDA concepts and to me it seems that the sdk is trying to abstract away the hardware, yet there are so many limitations in place that you have to know the hardware to use the sdk. Anways, here goes:

why is the warp size 32? To me it seems that since each multiprocessor has 8 processors, 8 would have been a better warp size. Is this 32 threads per warp a universal truth, should I code in that way, will it change, where did 32 come from?
if a multiprocessor has 8 processors, I can understand why the max number of active blocks is 8 (each processor is processing a block) but how can the max number of active warps be 24 and the max number of active threads be 768 as stated in the programming guide A.1? Is each processor really capable of handling 96 operations all at once? If so, is there there something lower than the processor that is not being mentioned in the programming guide like each processor has 16 ALUS capables of handling 4 floats or something? From my understanding it goes: a card has 1-16 mps, each mp has 8 processors, each processor can XXX operations in parallel. I understand that with Intel SIMD you can do 4 adds/muls/etcs per op, but 96?
does each multiprocessor have a scheduler? why is the max threads per block 512 if there is a scheduler? where does 512 come from if 11 lines down in appendix A1 it says an mp can handle 768 threads concurrently? Or does active not mean concurrent?

Any help would be greatly appreciated. We’re trying to get some computer vision algorithms onto the GPU and really want to use this technology.

And I thought register combiners were a pain…

Thanks,

Tom

tachyon_john · January 11, 2008, 3:57am

It’s conceivable that the warp size could change, it as a runtime-queryable property of the GPU device you’re attached to. I’d like a smaller warp size personally…

As I recall, the reason it’s 32 rather than 8 has to do with the hardware being pipelined. I had thought this was in the programming guide at some point, but if not, it’s certainly mentioned in the UIUC ECE498 class notes somewhere. You may find the UIUC notes interesting if you’re curious about such things.

One wants to schedule more than one thread per SP for the purposes of latency hiding…

Not all of the threads on a SM are executing at once but they are “on” the hardware, ready to run when given the opportunity (e.g. while another thread is doing a global memory read)

The word “active” means that the thread state is on the SM in registers etc, but that does not necessarily mean that all of the threads are executing concurrently, some are waiting. As you read more you’ll understand these distinctions better. Perhaps the best way to think about it when writing code is that one warp of threads is running at once, the other threads are on the SM, but waiting their turn to run. Multiple blocks can be co-scheduled onto a single SM, so even though the max number of threads in a block is 512, you could co-schedule up to 768 threads by having 3 blocks of 256 threads each mapped to that SM. The percentage of this “full SM” capacity in use is referred to as “occupancy” in the profiler. You don’t necessarily have to have high occupancy in order to achieve high performance, but it is helpful for kernels that can benefit from overlapping computations and global memory operations.

Cheers,

John

gz1 · January 11, 2008, 1:32pm

Wonderful, thanks for the info and the link to the course pages.

Topic		Replies	Views
CUDA WARPS Conceptual question regarding warps CUDA Programming and Performance	6	3619	May 30, 2008
Basic question about warps CUDA Programming and Performance	14	6571	June 9, 2009
Warps - Number of threads running concurrently CUDA Programming and Performance	4	2165	March 19, 2011
Warp switching does anybody understands the mechanism CUDA Programming and Performance	16	8473	March 28, 2008
A question about the CUDA's thread parallelization CUDA Programming and Performance	12	63011	January 25, 2009
help me understand cuda CUDA Programming and Performance	4	6873	February 10, 2010
Why does a warp consist of 32 threads? Why is a thread not say 16 or 64 threads? Whats the hardware CUDA Programming and Performance	14	20570	September 15, 2009
question about warp, block and threads CUDA Programming and Performance	4	2002	February 3, 2009
Whats a WARP for? CUDA Programming and Performance	8	6366	June 21, 2007
A question about the correspondence between warp and core CUDA Programming and Performance	17	7769	February 1, 2019

Clarification on concept to hardware mapping

Related topics