Threads Dispatching : 2 different instructions per cycles?

emir.damergi · January 31, 2025, 12:12pm

From my readings, of many sources including this site and CUDA C programming guide, I have come to the following understanding.

One partition (Ampere, example GA106) contains 3 processing data paths:
1x FP32 Engine (16 Cores)
1xFP32/INT32 Engine (16 Cores)
1x SFU Unit (1)
Warp Scheduler can execute 2 instructions in the same cycle if

They are independent
Processed by different data paths.

I would appreciate it if someone could answer my questions:

It is possible that a partition executes in the same cycle:

16 FP32 + 16 FP32/INT32
or
16 FP32 + 1 SFU
or
16 INT32/FP32 + 1 SFU

It YES , how can the instruction i of 32 threads of the same WARP can be different (i.e for example the instruction i of thread (5 for example) can be a FP32 mult and the instruction i of the thread (13 for example) can be a SFU fnction (such as _sin, _cos).

Greg · January 31, 2025, 6:47pm

CC2.1 - CC6.x warp schedulers support dual-issuing warp instructions per cycle. CC 7.0 (Volta) - present support single issue of warp instructions per cycle.

In CC2.1 the warp scheduler choice of dual-issue was dynamic.
In CC3.0 - 6.x dual-issue is determined by the compiler.

On each cycle a warp cycle selects an eligible warp and issues the warp instruction. The warp lanes may then be sub-divided by the width of the execution unit (e.g. FP32 unit is 16 lanes) and fed over multiple cycles.

Curefab · January 31, 2025, 7:55pm

You mean the instruction executed by thread/lane i of the warp?

All threads of a warp have to execute the same instruction or optionally be inactive during that time (e.g. conditional/if…else blocks). It is not possible to separate the work between different threads. Even if there are only 16 FP32 computation units and 16 INT32 computation units for some GPU, one cannot give each half-warp different instructions and expect the GPU to run with the speed as if one instruction was given to the full warp. The - lower than 32 - number of computation units does not lower the warp granularity for scheduling. The dual-issue @Greg mentions affects the whole warp, i.e. two independent instructions are started at the same time for all threads of a warp (for cc 2.1-6.x, so no Ampere).

Not sure, whether some units with either very low count or not fixed latency (e.g. behind MIO pipeline) finish early, if only some lanes use them. E.g. SFU, FP64 on consumer.

The feeding of the execution units, @Greg mentioned, does not prevent other execution units / pipelines from being scheduled for in the following cycle, when the feeding is still going on.

Topic		Replies	Views
Clarifing the process of issuing instructions on CUDA devices CUDA Programming and Performance	5	322	March 26, 2024
warp scheduler of Fermi architecture CUDA Programming and Performance	2	3203	February 5, 2012
Threads per warp vs number of cores CUDA Programming and Performance	2	2602	February 3, 2009
Warp threads execution model CUDA Programming and Performance	8	2768	January 19, 2010
Understanding CUDA scheduling CUDA Programming and Performance	4	15238	May 20, 2014
Basic question about warps CUDA Programming and Performance	14	6571	June 9, 2009
regarding transcendental instruction execution cycles in Fermi CUDA Programming and Performance	7	2380	November 19, 2010
A Question about how Ampere/Lovelace (RTX 3000/4000, GA10X/AD10X) cards handle Warp Dispatching CUDA Programming and Performance	13	415	June 1, 2024
About the number of CUDA cores in SMSP, less or gerater than warp threads number(32) CUDA Programming and Performance	8	781	June 17, 2024
warp and core What's the relationship between warp and core? CUDA Programming and Performance	12	15555	February 4, 2011

Threads Dispatching : 2 different instructions per cycles?

Related topics