Thread Scheduling Concept

hamidsahab · June 15, 2012, 2:32am

Dear All,

I am new to GPU computing and for the start-up, i am studying “Programming Massively Parallel Processor” book. I was going smoothly but now i am stuck with the concept of Thread Scheduling. As far i studied SM have 8 SPs, my device is GeForce-310. After checking its device query i found that it has 2Multiprocessors x 8 cuda cores. I studied on some other posts that cuda cores are SPs (not sure about this, please reply for this also). Then i studied that warp size is 32 threads, what they mean by warp??.
In the book they give examples of GT200 that it can have only 1024 threads per SM, for my device case 512 threads can be accommodated in single block so it means in my device SM total 512(threads) x 8(blocks)= 4096 threads can be accommodated?? out of which 32 (warp size) threads will execute single instructions??

Please response tot his question so that i can proceed further.
Waiting for favorable response.
Thanks

parallelis · June 15, 2012, 3:56pm

Some clarifications, but you’d better reading the nVidia cuda documentation extensively, takes some notes, re-read it, including appendices, because the devil is in the detail :)

[*]Cuda Core are also called SP

[*]A warp is a group of 32 threads, that follow the same execution path (you have to understand warp to understand divergence!)

[*]You could launch a huge number of threads on your device, but 384 threads (4 x cuda core number x 6 latencies) is the right number to begin with

There’s no thread scheduling in CUDA, but instead warp (or even half-warp) scheduling, the same way that you could parallelize processing using MMX, SSE or AVX: there’s just one PC (Program Counter) per Warp, and thus the 1 to 32 threads that are on the Warp are all executing the same instructions (with exception of conditionnals, using an execution mask).

The 32 threads of a Warp are executed (began pipelined execution) on 4 consecutive cycle on the 8 SP. 8SP x 4 consecutive cycle = 32 identical instructions in parallel.

After each group of 4 cycles where a full Warp begin execution of one unique instruction, another Warp may be elected (or the same Warp) to go for the next 4 cycles. Scheduling is not documented in-depth and “may change at any point”, but it seems to me it’s kinda round-robin.

You may need more than one Warp (32 threads) to optimally use the 8 SP, because of some instruction latency, as-well as register writing-reading latency, where the next instruction read a register that is the destination of the previous instructions, so the basic rule is to have at least 6 Warps running on 8SP (1 SM). This is 192 threads for 8 SP, 24 threads per SP.

krarjun90 · June 21, 2012, 5:22am

Some clarifications, but you’d better reading the nVidia cuda documentation extensively, takes some notes, re-read it, including appendices, because the devil is in the detail :)

[*]Cuda Core are also called SP

[*]A warp is a group of 32 threads, that follow the same execution path (you have to understand warp to understand divergence!)

[*]You could launch a huge number of threads on your device, but 384 threads (4 x cuda core number x 6 latencies) is the right number to begin with

There’s no thread scheduling in CUDA, but instead warp (or even half-warp) scheduling, the same way that you could parallelize processing using MMX, SSE or AVX: there’s just one PC (Program Counter) per Warp, and thus the 1 to 32 threads that are on the Warp are all executing the same instructions (with exception of conditionnals, using an execution mask).

The 32 threads of a Warp are executed (began pipelined execution) on 4 consecutive cycle on the 8 SP. 8SP x 4 consecutive cycle = 32 identical instructions in parallel.

After each group of 4 cycles where a full Warp begin execution of one unique instruction, another Warp may be elected (or the same Warp) to go for the next 4 cycles. Scheduling is not documented in-depth and “may change at any point”, but it seems to me it’s kinda round-robin.

You may need more than one Warp (32 threads) to optimally use the 8 SP, because of some instruction latency, as-well as register writing-reading latency, where the next instruction read a register that is the destination of the previous instructions, so the basic rule is to have at least 6 Warps running on 8SP (1 SM). This is 192 threads for 8 SP, 24 threads per SP.

Hi,

As I have similar doubts I am replying to you.

My hardware has (6 SM * 8 cores = 48 cores in total). I launch a kernel with 48 blocks. Now how the blocks execute?

Will it schedule one block per SM or one block per core ?

ymc · June 21, 2012, 8:21am

It will schedule one block per SM. The concept of core is not useful in CUDA programming.

Topic		Replies	Views
Warp scheduling - have I got this right? CUDA Programming and Performance	17	12177	February 12, 2013
How more exactly a thread is executed on GPU CUDA Programming and Performance	9	3006	March 7, 2017
Threads per warp vs number of cores CUDA Programming and Performance	2	2602	February 3, 2009
warp and core What's the relationship between warp and core? CUDA Programming and Performance	12	15596	February 4, 2011
thread, warp, block, grid, device CUDA Programming and Performance	3	6531	November 25, 2016
Warps - Number of threads running concurrently CUDA Programming and Performance	4	2173	March 19, 2011
how many threads concurrently run at a clock? CUDA Programming and Performance	3	1427	April 15, 2009
help me understand cuda CUDA Programming and Performance	4	6882	February 10, 2010
Basic question about warps CUDA Programming and Performance	14	6601	June 9, 2009
How do CUDA cores on a SM execute warps concurrently? CUDA Programming and Performance	8	28719	July 4, 2019

Thread Scheduling Concept

Related topics