Warps and Occupancy

hjazz · April 18, 2011, 10:56am

Hi,

I have always thought that the warp scheduler will execute one warp at a time, depending on which warp is ready, and this warp can be from any one of the thread blocks in the multiprocessor. However, in one of the Nvidia webminar slides, it is stated that “Occupancy = Number of warps running concurrently on a multiprocessor divided by maximum number of warps that can run concurrently”. So more than one warp can run at one time? How does this work?

Thank you.

tera · April 18, 2011, 12:37pm

It depends on how you define running concurrently. So far, compute capability N.x devices issue instructions for N warps in parallel, which then remain in the pipeline for about 16…24 cycles. The Nvidia slide you are referring to however obviously defines the running warps as all active warps on an SM, regardless of whether they issue an instruction in a particular cycle.

hjazz · April 19, 2011, 1:48am

I thought only devices of compute capability 2.1 has dual warp scheduler, while devices of compute capability 2.0 and below only have single warp schedulers?

So if we define “running” concurrently as issuing instructions and not simply waiting in the pipeline, then only 1 warp runs at one time for compute capability 2.0?

avidday · April 19, 2011, 6:40am

Compute 2.0 are dual-issue designs. Instructions from two warps are dual-issued (16 cores per warp), and retired over two clock cycles. Compute 2.1 takes this further by taking a different instruction from one of those warps and issuing it on the third bank of 16 cores, also retired over two clock cycles. So compute 2.1 has something close to out-of-order execution, on top of the basic dual issue design of 2.0 cards.

tera · April 19, 2011, 9:15am

Note that also 2.0 and even 1.x GPUs can issue a second instruction from the same warp in parallel. 1.x GPUs used this to sometimes issue a mul to the special function unit. On 2.0 devices this capability seems somehow underused, as only 2.1 devices added the third set of cores to reenable more than one arithmetic operation per cycle per thread. Still there should be some instructions (moves?) that can be dual-issued on all GPUs, though I haven’t tried to identify them.

Topic		Replies	Views
warp scheduler of Fermi architecture CUDA Programming and Performance	2	3209	February 5, 2012
Warps - Number of threads running concurrently CUDA Programming and Performance	4	2171	March 19, 2011
How do CUDA cores on a SM execute warps concurrently? CUDA Programming and Performance	8	28700	July 4, 2019
Execution of warps CUDA Programming and Performance	1	1552	January 7, 2009
Basic question about warps CUDA Programming and Performance	14	6587	June 9, 2009
How to understand "active thread block"? CUDA Programming and Performance	4	536	August 4, 2023
Beginner's question about concurrent warp execution. CUDA Programming and Performance	3	2476	July 4, 2019
Occupancy calculator CUDA Programming and Performance	2	923	January 31, 2011
Any need to revise the principle "Threads in a half-warp are SIMT synchronous" ? CUDA Programming and Performance	1	693	July 30, 2013
Why sometimes number of issued warps is smaller than the number eligible warps? CUDA Programming and Performance	4	925	April 3, 2019

Warps and Occupancy

Related topics