CUDA WARPS Conceptual question regarding warps


I have a question regarding warps in CUDA.

My understanding is that, the GPU device has many multiprocessors (say N) and each multiprocessor has several processors (say M).

So, if we load a kernal on to the device, it is executed as a grid of thread blocks. These blocks are sheduled for execution on the multiprocessors.

The active blocks are then split into groups of threads called warps which are executed simultaneously.

Since the maximum number of processors on a single multiprocessor is M, shouldn’t the warp size always be M?

Can anyone please clarify this?

No, it should be a multiple of M. Think of it as M-lane vector processor with vector length equal to warp size.

I was just browsing the CUDA Programming guide (Appendix A)

and it was metioned that the warp size is 32 threads…

So, if the warp size is ‘fixed’ at 32 threads, does it mean that the number of processors in each multiprocessor (M) is a factor of 32…??

Does this mean that the value of M is restricted ?


Yes, the number of multiprocessors is 8 in all current CUDA cards. The warp size is also fixed for all current cards at 32, although you query the warp size at runtime. Future devices may have different values.

8 cores per multiprocessor, you mean. One of my cards has 12 multiprocessors, another 16.

Said before but just as starter:

Each GPU has multiprocessors ranging to 16 at the top models. Each mp now has 8 SIMD processors.

The warps size of 32 is easy explained if you go into detail of the instruction set.

The most common instructions take 4 clock cylces. As each warp can issue 8 threads at a time (remember the 8 SIMD processors), it takes 4 steps to issue all threads. At this point the pipeline is finished for the first 8 threads.

You can ensure scaling by splitting up your problem to as many blocks as possible. As long as you do this and keep at least 64 threads per block (== 2 warps), your algorithm should be independent of the amount of multiprocessors

In my opinion the number of SIMD procs per MP will not increase too much, as SIMD constraints then become a problem for scientific computing.


Ah, nuts. Yeah, I mean 8 processors per multiprocessor. Sorry to add to the confusion.