How many parallel threads?

This should be simple to answer for somebody who really knows CUDA well I think. I have a Tesla1060c. How many threads can be run concurrently?

More specifically, say I have blocks which are 16 x 16 threads. This means 4 blocks can fit on each multiprocessor. Does every thread in these 4 blocks run in parallel (or close to it)? If that is the case, then 30x1024 = 30720. Is this number the answe to my first question? Also, say there are 1 million threads in a grid. At what level does this code run sequentially, i.e. do I run 30x4 blocks in parallel at a time, and when those are finished, move to the next 120 blocks?

The number of threads that run concurrently is equal to the number of CUDA cores in your graphics card. With 30 blocks of size 16x16 you would have 240 (?) threads executing concurrently (30 MPs with 8 CUDA cores each?). The remaining threads will be “on-the-fly” and ready to execute thanks to the scheduling. Im not sure how many blocks and warps can be scheduled on your tesla so maybe someone else knowing its architecture better can help you there. But with Fermi you can e.g. schedule 48 warps per MP which are “send” to the CUDA cores by the warp schedulers each executing one warp at a time on the CUDA cores. Scheduling more threads than can be executed concurrently helps hiding latency, e.g. from memory accesses. So you will want to have enough warps and blocks to keep your schedulers busy and hide latency.

Ok so the number of threads that can be run in parallel is equal to the number of processors, 240 for my Tesla 1060c. I’m still a little confused, because I thought each warp had to execute exactly in sync. If I have 7 warps that is only 224 threads. So it must be only half warps are required to be in sync, or the processors can never be filled.

Here is my understanding of the scheduling now: The first warp of the first 7(depending on the answer to my previous question, maybe 15?) blocks will be run, then the second warp of the first 7 blocks, etc. until the first group of blocks is finished. It then moves on to the next group of 7 (or 15) blocks and executes the warps of these blocks in the sequence just described.

Is my understanding correct? I have read both the programming guide and the book by Kirk and Hwu and have not seen this explained in detail explicit enough for me to follow.

Sry but what do you mean with “have to execute in sync”?
As for your 2nd question Im not sure if I hit the point but maybe this helps:
Blocks are mapped to MPs. You have 30 MPs afaik each with 8 CUDA cores. So lets say you have 30 blocks with 32 threads each (not ideal, just an example). Each MP has one warp scheduler and gets one block to run. Each warp scheduler runs one warp in 4 cycles. First, it runs 8 threads of the warp on the 8 CUDA cores of the MP. Then the next 8 and so on until all 32 threads of this warp and block have been executed. Same happens on the other MPs. So to make use of all your CUDA cores and run the maximum of amount of threads simultaneously, you need 1 block per MP. And the 32 threads of a single warp (in one block always) are executed by one single MP - 8 at a time (times 4 cause you always have 32 threads per warp even if specifying a block size of less than 32 - some of the threads dont do anything in such case). Warps of one block are run sequentially (right? ^^ So Fermis need 2 blocks per MP at least to use all CUDA cores?).

Directly from Kirk and Hwu, page 98:

“The hardware executes an instruction for all threads in the same warp, before moving to the next instruction.”

This seems to directly contradict your explanation, where I think you are saying “The hardware executes an instruction for a group of 8 threads within a warp before moving to the next instruction.” You are also implying that 4 groups of 8 threads in a warp are processed sequentially, while the Kirk and Hwu quote implies, 32 threads executing exactly in parallel.

Both statements are correct. What your quoted phrase is referencing to, is that all currently running threads of one warp do the same instruction. This is why you have branching / code divergence if you use if-clauses in your kernels and threads of the same warp follow different execution paths.
I combine both statements:
The hardware executes one instruction for all threads of one warp by executing 8 threads on the 8 CUDA cores of the MPs at a time before going on to the next instruction. This takes 4 clock-cylces per instruction and warp.
E.g.:
// The hardware executes an instruction for all threads in the same warp
Cycle 1: threads 0-7 do instruction 1
Cylce 2: threads 8-13 do instruction 1
Cycle 3: threads 14-23 do instruction 1
Cycle 4: threads 24-31 do instruction 1
// before moving to the next instruction
Cycle 5: threads 0-7 do instruction 2
Cylce 6: threads 8-13 do instruction 2
Cycle 7: threads 14-23 do instruction 2
Cycle 8: threads 24-31 do instruction 2

So what I said is that the hardware executes one instruction for the first 8 threads of a warp on the CUDA cores of one MP, then the same instruction for the next 8 threads of this warp on the CUDA cores of the same MP and so on. This is why you wont see branching when threads of different warps follow different execution paths while all threads in one warp follow the same path.
The Kirk/Hwu quote says that HW executes instruction after instruction for all threads in a warp without mixing them up (not: instr1-instr2-instr1-instr1-…). It does not say this first instruction is completed for all thread at the same time. It just says that it is completed for all threads before the next instruction is done.
I hope my explanation is ok and helps you. Just feel free to go on asking.

Ahhh I see now, thank you very much for the explicit and well thought out explanation!

To add a little more… although a warp has 32 threads, whereas an SM has 8 cores, execution is pipelined, and the pipeline has four stages. This means that the following occurs
clock 0: Threads 0-7 start the instruction
clock 1: Threads 8-15 start the instruction
clock 2: Threads 16-23 start the instruction
clock 3: Threads 24-31 start the instruction
clock 4: Threads 0-7 complete the instruction and store results
This is why all threads in a warp appear to execute simultaneously - all the threads in the warp have begun the pipeline (and hence ‘internalised’ their inputs) before the first threads in the warp complete.

In reality, it’s even more complicated, but hopefully this is roughly correct.

Hmm the rule of thumb I use is have at least 20 times as many threads as stream

processors. This would come to 4800 for a 240 SP GPU.

Generally there seems to be little down side on having too many threads but a

big perfoamnce penalty on having too few.

Bill

Hm. This rule doesnt include the number of blocks or the warp size in the calculation. Considering that a 240 SP GPU has 30 MPs you would need at least 30 blocks to give workload to each MP and SP. The better rule of thumb is to create a number of blocks greater than or equal to the number of MPs and at least double this amount in order to hide latency and allow for better scheduling. I normally saw the best performance when specifying a block size of at least 256, which you should also include, I think.

With the presence of pipelining on these MPs (and I believe the pipeline depth is much bigger than 4 stages), I find it not useful to define concurrency in terms of number CUDA cores. Instead I parameterize the behavior of an MP in terms of:

“concurrency”: # of threads using MP resources (i.e. “active” threads). Because CPUs actively swap threads in and out, defining concurrency as “running right now” and “using CPU registers” are basically the same thing, even if you include hyperthreading in the mix. CUDA is like massive hyperthreading where nothing is every swapped out, so I tend to think of all the threads holding registers as running concurrently.

“throughput”: # of thread-instructions completed per second. This is really what the # of CUDA cores control at the moment.

The generally accepted (and even recommended by Nvidia if you check the Programming Guide) formula is to have at least 6 active warps (or 192 threads) per SM to fully hide the 24 cycle instruction latency. Given that one SM has 8 cores (pre-Fermi), that would amount to 192/8 = 24 threads per stream processor if you don’t want to waste cycles.

Uhm sry but I just can remember what SM stands for (in this context ^^). SP or CUDA cores run threads - or execute em which is the right term I think. And MP are the things having the 8 / 32 SPs / CUDA cores and the registers and shared mem etc. Now you mentioned SMs where I thought it should be about MPs. Im a bit confused so what exactly is the difference between SMs and MPs. Maybe I forgot sth.

Sorry, I was tired when I wrote that. I meant “SM” as in “streaming multiprocessor”. :)

Architecturally, CUDA cores are like fancy ALUs. Warps are the minimal unit of execution, so you can think of the bank of SPs as a SIMD unit. (Basically, the instruction decoder is at the SM level.)

Ah ok. Acutally my brain messed up SM with MP.

I have two questions

say I have a kernel with 96 threads per block and there is something that needs to be done that was to hard to parrallize, so I opted to just have thread 0 do it

e.g.

if ( threadIdx.x == 0 )

{

… about 50 instructions

}

  1. I beleive that only warp 0 will run during those 50 cycles ( i.e. warp 1 and 2 will be skipped)

  2. but thinking about YDD clock cycles above as only thread 0 needs to do anything will “threads 0-7 complete” be done at clock 1: instead of clock 4:

(my expectation is that that would be tricky for the pipelining so it will stay at clock 4

Thanks

I have two questions

say I have a kernel with 96 threads per block and there is something that needs to be done that was to hard to parrallize, so I opted to just have thread 0 do it

e.g.

if ( threadIdx.x == 0 )

{

… about 50 instructions

}

  1. I beleive that only warp 0 will run during those 50 cycles ( i.e. warp 1 and 2 will be skipped)

  2. but thinking about YDD clock cycles above as only thread 0 needs to do anything will “threads 0-7 complete” be done at clock 1: instead of clock 4:

(my expectation is that that would be tricky for the pipelining so it will stay at clock 4

Thanks

If only thread 0 has to do anything then thread 0 will need say 50 cycles while the other threads of this warp (which is the smallest group of threads launched at once) will each run but return immediately, still needing the minimum amount of 4 clock cycles (4 times 8 threads in 1 cycle each) per warp. If those other threads also have work to do then the instructions will be pipelined as posted before. So if thread 0 has x separate instructions you have a total amount of at least 4 + x clock cycles. During the time this warp runs no other warps will be executed by this SM on pre-Fermi. Fermi has 2 warp schedulers and would run your “long” warp on 16 CUDA cores and different warps on the other 16 CUDA cores of the SM. Because 8 threads of one warp (or 16 with Fermi) are run one after the other, threads 0-7 should complete when your 50 instructions thread 0 is done I think. The next warp will start as soon as the long one has finished.

If only thread 0 has to do anything then thread 0 will need say 50 cycles while the other threads of this warp (which is the smallest group of threads launched at once) will each run but return immediately, still needing the minimum amount of 4 clock cycles (4 times 8 threads in 1 cycle each) per warp. If those other threads also have work to do then the instructions will be pipelined as posted before. So if thread 0 has x separate instructions you have a total amount of at least 4 + x clock cycles. During the time this warp runs no other warps will be executed by this SM on pre-Fermi. Fermi has 2 warp schedulers and would run your “long” warp on 16 CUDA cores and different warps on the other 16 CUDA cores of the SM. Because 8 threads of one warp (or 16 with Fermi) are run one after the other, threads 0-7 should complete when your 50 instructions thread 0 is done I think. The next warp will start as soon as the long one has finished.

hello everyone, i have a gtx970 and i have the same doubt about “true” parallelism of threads, can i still apply this logic?

if i did understand it well in my case ( 013 Multiprocessors, 128 CUDA Cores/MP) supposing that i have 13 blocks with 5 warps each i would have something like this on every MP
Cycle 1: threads 0-127 do instruction 1
Cycle 2: threads 0-31 do instruction 1
Cycle 3: threads 0-127 do instruction 2

is it true?