Thread Scheduling Concept

Dear All,

I am new to GPU computing and for the start-up, i am studying “Programming Massively Parallel Processor” book. I was going smoothly but now i am stuck with the concept of Thread Scheduling. As far i studied SM have 8 SPs, my device is GeForce-310. After checking its device query i found that it has 2Multiprocessors x 8 cuda cores. I studied on some other posts that cuda cores are SPs (not sure about this, please reply for this also). Then i studied that warp size is 32 threads, what they mean by warp??.
In the book they give examples of GT200 that it can have only 1024 threads per SM, for my device case 512 threads can be accommodated in single block so it means in my device SM total 512(threads) x 8(blocks)= 4096 threads can be accommodated?? out of which 32 (warp size) threads will execute single instructions??

Please response tot his question so that i can proceed further.
Waiting for favorable response.

Some clarifications, but you’d better reading the nVidia cuda documentation extensively, takes some notes, re-read it, including appendices, because the devil is in the detail :)

    Cuda Core are also called SP

    A warp is a group of 32 threads, that follow the same execution path (you have to understand warp to understand divergence!)

    You could launch a huge number of threads on your device, but 384 threads (4 x cuda core number x 6 latencies) is the right number to begin with

There’s no thread scheduling in CUDA, but instead warp (or even half-warp) scheduling, the same way that you could parallelize processing using MMX, SSE or AVX: there’s just one PC (Program Counter) per Warp, and thus the 1 to 32 threads that are on the Warp are all executing the same instructions (with exception of conditionnals, using an execution mask).

The 32 threads of a Warp are executed (began pipelined execution) on 4 consecutive cycle on the 8 SP. 8SP x 4 consecutive cycle = 32 identical instructions in parallel.

After each group of 4 cycles where a full Warp begin execution of one unique instruction, another Warp may be elected (or the same Warp) to go for the next 4 cycles. Scheduling is not documented in-depth and “may change at any point”, but it seems to me it’s kinda round-robin.

You may need more than one Warp (32 threads) to optimally use the 8 SP, because of some instruction latency, as-well as register writing-reading latency, where the next instruction read a register that is the destination of the previous instructions, so the basic rule is to have at least 6 Warps running on 8SP (1 SM). This is 192 threads for 8 SP, 24 threads per SP.


As I have similar doubts I am replying to you.

My hardware has (6 SM * 8 cores = 48 cores in total). I launch a kernel with 48 blocks. Now how the blocks execute?

Will it schedule one block per SM or one block per core ?

It will schedule one block per SM. The concept of core is not useful in CUDA programming.