Control instructions

The result of a profiling run shows

issued control-flow instructions = 881300
control-flow instructions = 23657200
executed control-flow instructions = 881300

According to the definitions:

cf_issued Number of issued control-flow instructions
inst_control Number of control-flow instructions executed by non-predicated threads (jump, branch, etc.)
cf_executed Number of executed control-flow instructions

The questions are:
1- What is the purpose of such classification?
2- Control instructions are thought to be jump and branches. Isn’t that? So, what is a control flow instruction that is not jump, branch, etc.?
3- If I want to compare with integer of inter thread instructions, should I use cf_executed or inst_control?

1 Like

Hi @mahmood.nt, could you find the answers to your questions?

there is this other post, but the link provided in the answer is not working

These are my own findings based on searches and documents. So, any more explanations are welcomed.
1- For instructions, the term “issue” generally refers to the “replays” [1]. On the other hand, the term “executed” refers to “completion/retirement”. So executed is less than or equal to issued.

2- Control flow instructions, collected by “inst_control” is calculated by the summation of SASS instructions classified as “Control Instructions” in that specific architecture [2]. You can use NVBit [3] to get the instruction histogram and you will see the results are similar, though little differences are accepted.

3- For my work, I found that inst_control matched what I wanted.

Hope that helps.

[1] https://stackoverflow.com/questions/35566178/how-to-explain-instruction-replay-in-cuda
[2] https://docs.nvidia.com/cuda/cuda-binary-utilities/index.html#instruction-set-ref
[3] https://github.com/NVlabs/NVBit

I tried to execute NVBit and nvprof on 0_Simple/matrixMul from the NVIDIA samples. But the values vary too much.

I executed only two iterations of the matrixMul on a Tesla K40, with CUDA 10.1.

inst_control 2048000
the sum of the SASS instructions from NVBit histogram tool that are considered control instructions from Kepler instruction set [1] is 153600

It is almost 14x times bigger. Do you have any clue what it could be?

[1] https://docs.nvidia.com/cuda/cuda-binary-utilities/index.html#kepler

What are the commands for nvprov and nvbit then?

For nvprof
nvprof --metrics inst_control ./matrixMul

For nvbit
eval LD_PRELOAD=nvbit_release/tools/opcode_hist/opcode_hist.so ./matrixMul

nvprof numbers are at thread level while nvbit by default reports at warp level.

For nvprof I see

Invocations                               Metric Name                        Metric Description         Min         Max         Avg
Device "TITAN V (0)"
    Kernel: void MatrixMulCUDA<int=32>(float*, float*, float*, int, int)
        301                              inst_control                 Control-Flow Instructions     2048000     2048000     2048000

To force nvbit to use thread level stats, one have to export COUNT_WARP_LEVEL=0
So, the command

COUNT_WARP_LEVEL=0 LD_PRELOAD=~/nvbit/nvbit_release/tools/opcode_hist/opcode_hist.so ./matrixMul

Shows the following instructions for each invocation:

  BAR.SYNC = 4096000
  BRA = 2252800
  EXIT = 204800
  FFMA = 65536000
  IADD3 = 8806400
  IMAD = 1024000
  IMAD.WIDE = 4300800
  ISETP.GT.AND = 2252800
  LDG.E.SYS = 4096000
  LDS.U = 65536000
  LDS.U.128 = 16384000
  LEA = 204800
  MOV = 3072000
  NOP = 4096000
  S2R = 1433600
  SHF.L.U32 = 1228800
  SHFL.IDX = 204800
  STG.E.SYS = 204800
  STS = 4096000

For volta, according to [1], only EXIT and BRA are considered as control flow instructions. So 2M from nvprof vs. 2.4M from nvbit. The difference basically comes from how nvbit and nvprof developers classify instructions IMO.

To make yourself relax, you can check other instructions types. For example:

nvprof, inst_fp_32 is 65,536,000
nvbit (FFMA is the only FP32 instruction here) is also 65,536,000

nvprof, inst_integer is 17,817,600
nvbit
IADD3 = 8806400
IMAD = 1024000
IMAD.WIDE = 4300800
ISETP.GT.AND = 2252800
LEA = 204800
SHF.L.U32 = 1228800
SHFL.IDX = 204800
=> 18,022,400

You may also want to ask nvbit guys for more precise definitions.

[1] https://docs.nvidia.com/cuda/cuda-binary-utilities/index.html#volta