Warps - Number of threads running concurrently


I am not sure I understand the whole warps mechanics. Suppose I have a GPU with 1 MP and 8 SP and 1024 threads can reside on my GPU at the same time for a total of 1024/32 = 32 warps. Does each SP takes a warp (based on which one is ready to be executed) executes one instruction, takes another warp (that could belong to another block) and so on, in such a way that a core doesn’t execute an entire block by itself?
So in my case I would have a total of 8 * 32 = 256 threads running concurrently?

Also to best hide the latency due to memory-access / threading we should only care that the number of resident warps is maximized?


If I’m not wrong, having only 8 SPs would mean that you can have only 8 threads running concurrently. The warp scheduler takes a warp, and then issues the next instruction of this warp over 4 clocks, so instructions for a quarter-warp are issued in each clock. SPs work in groups. SPs in the same group always deal with threads from the same warp. In your case, your SPs are probably grouped in 8, and there is only one group. In GTX 4** each MP has 2 or 3(GF104) groups of ALUs (SP), each consisting of 16 cores.

Though, if you consider the length of the pipeline ( arithmetic latency ), it’s possible that you can be running 256 threads at the same time

What do you mean with “SPs work in group”? A group of SPs is a multiprocessor (MP) ?

The name “CUDA core” or the older name “streaming processor (SP)” is kind of misleading. A CUDA core is not like a CPU core at all, since it has no separate registers or instruction scheduler. Rather, a CUDA core is more like a fancy pipelined arithmetic unit.

The multiprocessor is the entity that does instruction scheduling, manages the shared memory, cache, and register file. You should think of the multiprocessor like a CPU core that executes vector instructions of length 32 (i.e. the warp size). The number of CUDA cores per multiprocessor gives you an idea of the rate at which the multiprocessor can execute these warp instructions. With only 8 CUDA cores per MP, the compute capability 1.x devices could finish a warp instruction every 4 clocks. With compute capability 2.0, the 32 CUDA cores per MP were split into two groups of 16. Each of these groups can complete a warp instruction every 2 clock cycles, and two instruction scheduler units are used to feed these two groups of 16 CUDA cores. Double precision instructions required all 32 cores to work together on one warp. With the release of compute capability 2.1, each MP gets 3 groups of 16 CUDA cores, but no increase in the number of instruction schedulers. Instead, a scheduler can decide to issue a second independent instruction from one warp into the third group of 16 CUDA cores. This is not always possible, hence the compute throughput per MP of compute capability 2.1 is usually somewhere between 1.0 and 1.5 times that of capability 2.0.

“Threads” in CUDA are nothing like CPU threads, so it is not useful (in my opinion) to talk about thread concurrency beyond the definition used for occupancy: how many warps are actively available to the instruction schedulers on the MP. More warps means more opportunities to hide global memory latency, but may also affect your memory access patterns, so you have to benchmark many options if you really want to be sure you are getting maximum throughput. (I usually stick with 256 threads per block unless I get to the point where squeezing every last bit of speed matters.)

take a look at this:GF104

pretty good explanation about the “group”