question about warp, block and threads

I don’t understand very well some specifications of cuda.
1.- what’s the warp? parallel threads in a block?

2.- we have this information:
“the maximum number of active blocks per multiprocessor is 8”
“the maximum number of active warps per multiprocessor is 24”
“the maximum number of active threads per multiprocessor is 768”
“the warp size is 32 threads”.

so, if we have 40 threads in the same block but warp size is 32 threads, we have 2 cycles? so the real parallel not exist??

thank you

I thought a warp consisted of 32 threads handled in one time? So if you have 40 threads in a block, it will indeed take 2 cycles to handle them…

Why wouldn’t parallism exist this way? 32 threads are handled in one cycle…

EDiT: This occurs on every core in the streaming multiprocessor, so the more blocks you have the more parallelism… So if you take 2 blocks to process your 40 threads, it’ll be faster than taking just one…

Correct me if I’m wrong, but I think it’s like that…

OK. I agree with you.

The answers to all of these are important… it’s all covered in the Programming Guide. Understanding the computation model really is key to getting your software running with full performance.

A warp is a bundle of 32 threads which work together in lockstep. If one of the threads is adding two numbers, the other 31 threads are doing the same, or temporarily doing nothing. Much of the raw power of GPU programming comes from this ability to run big chunks of 32 computes at once instead of one at once.

Blocks do indeed give you even more parallelism, since each SM on the GPU can be running at once. For example, on a 280GTX card, there are 30 SMs and each one is independent and is running its own blocks (with their own warps).

But you need to read the programming guide for all the details. For example, one SM usually runs more than one block at once by interleaving the execution of the warps in the (multiple) blocks. This interleaving is very low level and allows the GPU to eliminate much of the memory latency limitations of most algorithms. As you study the programming guide, you’ll start peeling open the abstractions and learn many even subtler points like the fact that there’s 8 SP processors inside each SM, not 32, but they run with a 4 clock scheduler.

thank you SPWorley for your time and information.