Threads Dispatching : 2 different instructions per cycles?

From my readings, of many sources including this site and CUDA C programming guide, I have come to the following understanding.

  1. One partition (Ampere, example GA106) contains 3 processing data paths:
    1x FP32 Engine (16 Cores)
    1xFP32/INT32 Engine (16 Cores)
    1x SFU Unit (1)
  2. Warp Scheduler can execute 2 instructions in the same cycle if
  • They are independent
  • Processed by different data paths.

I would appreciate it if someone could answer my questions:

  1. It is possible that a partition executes in the same cycle:
  • 16 FP32 + 16 FP32/INT32
    or
  • 16 FP32 + 1 SFU
    or
  • 16 INT32/FP32 + 1 SFU
  1. It YES , how can the instruction i of 32 threads of the same WARP can be different (i.e for example the instruction i of thread (5 for example) can be a FP32 mult and the instruction i of the thread (13 for example) can be a SFU fnction (such as _sin, _cos).

CC2.1 - CC6.x warp schedulers support dual-issuing warp instructions per cycle. CC 7.0 (Volta) - present support single issue of warp instructions per cycle.

In CC2.1 the warp scheduler choice of dual-issue was dynamic.
In CC3.0 - 6.x dual-issue is determined by the compiler.

On each cycle a warp cycle selects an eligible warp and issues the warp instruction. The warp lanes may then be sub-divided by the width of the execution unit (e.g. FP32 unit is 16 lanes) and fed over multiple cycles.

You mean the instruction executed by thread/lane i of the warp?

All threads of a warp have to execute the same instruction or optionally be inactive during that time (e.g. conditional/if…else blocks). It is not possible to separate the work between different threads. Even if there are only 16 FP32 computation units and 16 INT32 computation units for some GPU, one cannot give each half-warp different instructions and expect the GPU to run with the speed as if one instruction was given to the full warp. The - lower than 32 - number of computation units does not lower the warp granularity for scheduling. The dual-issue @Greg mentions affects the whole warp, i.e. two independent instructions are started at the same time for all threads of a warp (for cc 2.1-6.x, so no Ampere).

Not sure, whether some units with either very low count or not fixed latency (e.g. behind MIO pipeline) finish early, if only some lanes use them. E.g. SFU, FP64 on consumer.

The feeding of the execution units, @Greg mentioned, does not prevent other execution units / pipelines from being scheduled for in the following cycle, when the feeding is still going on.