Warp Size Question

I know that the warp size indicates the number of threads that will be executed with the same set of instructions.
“Each multiprocessor processes threads in groups called warps.”
My question is, if a warp size is 32 does that mean that all threads in the warp are actively executing at a given time. For example, my device has only 16 cores per multiprocessor. So if all threads have to be active, does each core run 2 threads??

Only 16 core per SM? Usually it’s 8, so each core has to execute 4 threads, and thus an instruction needs 4 clock cycles to complete. Here is some more info:

If your device has 16 cores per SM, I guess you should need only 2 cycles per instruction, but I really thought that what could change is the number of SMs, but not the number of cores per SM, that is fixed to 8. In any case, I’m beginning with this, so I can be mistaken.

Depends on the architecture. For 1.x, your device would have 8 cores per SM and would need 4 clock cycles to execute all 32 threads in a warp for integer and float arithmetic instructions (32 cycles for double precision given that there exists only one DP unit). For 2.0, there are 32 cores per SM and also 2 schedulers. For integer or floating point instructions, each of the scheduler issues two instructions on two different warps, meaning that you can execute 2 warps in 2 clock cycles (for DP instructions, only one of the schedulers become active per instruction meaning that you can execute 1 warp in 2 clock cycles). I have read that this is due to the fact that DP units are made from two SP units + other auxiliary unit but perhaps someone can enlighten me on this issue .

Thanks for clarifying. As far as what you comment about DP units, in the link i posted above, at the end of the section “Dual Issue” it says that the double precision unit and the single precision units share logic and cannot be active simultaneously as a result.

aah! I think I have interpreted the deviceQuery output wrong. It says Number of multiprocessors: 2 and Number of cores: 16.
This probably means I have 8 cores per multiprocessor for a total of 16 cores!! As stated, I should have 8 cores per SM for compute capability 1.1. Is this true?
Also, the other question is, given this hardware, how many threads can be active at a given time in total? Will it be 32(warp size) * 16 (number of cores)? Or just 16 (number of cores) since each core will execute only 1 instruction per clock per thread? Is it correct to say that each core “simultaneously executes 4 threads”?

At the core level, it is easiest to think of the architecture as scalar - one instruction from one thread is retired per clock. But the logical “SIMD width” is 32 (ie. the warp size), and every instruction is effectively run 4 times over 4 successive clock cycles before it is retired. So 32 threads are running on a MP at any given time, 8 on any given clock cycle. To further complicate things, hardware can manage/schedule/pipeline a maximum 768 threads per MP on your generation of hardware. So you can have an answer anywhere between 16 and 1536 and not be wrong…

Is it true that one can as many thread blocks scheduled on a MP as there are cores available? So, for 1.x devices, there is 8 cores/processor. For 2.0 devices, 32 cores/processor. How does the MP handle the scheduling of blocks if there are multiple blocks available? I guess what I’m wondering is if the GPU picks one block out of the stack of blocks and makes it active on some MP, or does it grab as many as it has resources for, then allocate them on the MP and start scheduling warps from one of them?

I can’t find all the nitty-gritty details at the moment, so I thought you might know, avidday. I would appreciate your help.

No. The limit is just 8 blocks per MP on all current architectures AFAIK.

None of this is documented, but the consensus was that on pre-Fermi it was pretty simple - MP get “filled” with as many blocks as occupancy allows per scheduling event, and then not again until all those blocks have been retired and the MP is idle. None of this is part of the programming model or documented anywhere, so everything has been deduced by experimentation. Fermi is clearly much more complex/sophisticated than the older hardware, but I don’t think enough micro benchmarking has yet been done by people much more talented than I to illuminate what it does in detail.

I see what you’re saying. I just don’t understand the role of the thread block on each MP.

For pre-Fermi, I know that once a set of blocks are allocated to a MP, based on restrictions such as max threads, registers, shared memory, etc., and the MP waits until all blocks are finished until allocating the next set of blocks. For Fermi, there are 2 schedulers, and that might make it better somehow. When the blocks are on the MP, for pre-Fermi cards, does the MP execute a warp until it is finished, or does it execute a single instruction from a warp and then possibly schedule a similar instruction from a different warp?

I’m trying to understand how having 1 thread block allocated on a GPU is worse than having 8 blocks scheduled. To me, supposing negligible latency for scheduling new thread blocks, it seems like these 2 cases shouldn’t affect performance too much:

  1. 1 thread block on the MP. MP grabs a warp, starts executing instructions in 4 cycles for SP and 32 cycles for DP. With branching, some execution is serialized, but it is unavoidable.

  2. 8 thread blocks on the MP. MP grabs a warp from some block, starts executing instructions in 4 cycles for SP and 32 cycles for DP… When finished with this block, starts working on another block.

Do you see what I’m wondering about? Seems to me like the only difference is that we have more resources taken up in Case 2 and there might be more latency associated with scheduling multiple small loads of blocks to the MP, but that they execute instructions on blocks in the same manner on basically the same timeframe. What’s the difference?

What about a sample case in which all of the threads from the same block need to fetch data from the global memory to the shared memory with synchronization? With multiple thread blocks allocated to the same SM, the warp scheduler can assign warps from another block to execute while waiting for the memory fetch from the original block.

Okay, you are right. I can see how having multiple blocks would benefit when there are is global memory latency involved. My program I was dealing with could never utilize shared memory because the problems were so large. Also, using 512 TPB meant I could only have 1 block scheduled per MP because each thread used 32 registers (= 16,384 register, the max allowed). With 256 TPB, I could have 2 blocks scheduled on the MP. But, the performance was less than the 512 case. Also, going from 512 to 256 blocks effectively doubled the number of blocks I would create for the grid (from 11,740 to 23,480).

I think that, given my algorithm, it was more important to have a less number of blocks rather than 2 blocks scheduled per MP. In the smaller data sets, the 256 TPB performed better than 512 TPB, though. You want to have enough blocks to keep everybody busy all the time, but too many blocks takes too long.

So, would there be any other reasons why having 8 blocks assigned to a GPU is better than only 1?

Yes, as 1 thread instruction per cycle per core is only the throughput, but not the latency. Actual execution of 1 simple instruction takes about 22…24 cycles. If the next instruction wants to use the result from the previous one, it would have to wait all the cycles. Independent instructions from other warps can however execute in the meantime.

This is why you should at least have 24/4=6 warps assigned to each MP.

Yet another reason is __syncthreads(). Until the last warp in the block reaches the synchronization point, all other warps of the block have to wait. If you have a second block available, it’s warps can still run though.

Usually tiling techniques (dividing the data set into smaller data subsets) are used to address that issue, though it’s not always trivial.

Well, I guess that your card assigned both blocks to the same SM, so as what really matters inside a SM is the number of warps, and in both cases it was 16, the performance was not really improved. I think that if each block had been assigned to each SM, the performance should have increased (though of course the specific program you are running also has something to say, I guess).

I don’t really know if this is possible, I mean, that two blocks are assigned to a single SM, when having another SM free. Does somebody know if the SMs get completely filled one after another, or are blocks uniformly distributed among all SMs? (in the inicial assignment I mean).

Erased. Double posting.

I understand what you’re saying, tera, about the __syncthreads(). I know some people talk about the GPU cycles vs. GPU “fast” cycles, and that certainly confuses me. Is that why there a discrepancy when you say a simple instruction takes about 22-24 cycles and the Programming Guide says 4?

Also, I don’t know where you are getting the numbers for 24/4=6 warps assigned to each MP. Does this mean you need at least 6 warps from 6 different blocks, or possible 6 warps from 1 block, or something meeting in the middle?

The “fast” vs. “slow” clock thing is something different, I’m only talking about “fast” clock cycles.

Each core can start execution of one instruction of a thread per (fast) clock cycle. As a multiprocessor has only 8 cores, this takes 4 cycles for the 32 threads of a warp. So after 4 cycles, the warp can start execution of another instruction. This is the 4 cycles throughput number given in the Programming guide.

However, the result of the operation is only known after execution of the instruction finishes after about 22…24 (fast) clock cycles. So if the next operation wants to use this result (“depends on it”), it has to wait for 24 (instead of 4) clock cycles. Now if you have 6 warps running in a round-robin fashion, each warps instruction results will be available just in time for the next instruction of that warp to run. This is called latency hiding, as it now appears that latency is irrelevant and only throughput determines the execution speed.

For this it does not matter whether the 6 warps are from the same thread block or a different one. However, warps from the same block tend to block (waiting for memory or at a __syncthreads()) at the same time (as they execute the same instruction stream and syncronize at each __syncthreads()), so there is a slight advantage if some of the warps are from a different block.

There is another form of “latency hiding” in the Programming Guide. Since the latency numbers are mostly irrelevant once you have at least 6 warps per SM, the Programming Guide just gives the throughput numbers and the rule to have at least 6 warps per SM. Latency is mostly hidden in the Guide as well. :)

That is a very nice explanation, by the way. I really do appreciate it! I’m making a presentation to some of my coworkers, and I want to know as much as I can in case they have any tough questions. So, thanks again.

This is a minor detail that confused me initially but I had let it go. In the NVIDIA guide, it states that the register latency (subsequent read after write) is 24 clock cycles and this is why one needs 6 warps to hide this latency. However, given that the latency is 24 clock cycles (which I believe can be inferred as extra clock cycles), shouldn’t you need 7 total warps to hide this latency?

I guess this just depends on how you define latency. In my definition (which I believe to be the generally accepted one) latency is the total number of clock cycles after which the result is available (not the number of cycles on top of the throughput number).

Yes, the definition confused me. For example, it’s weird how a latency of 4 cycles would be zero latency in the context of CUDA.