instruction or operation

Regarding the following metrics

inst_fp_32 : Number of single-precision floating-point instructions executed by non-predicated threads (arithmetic, compare, etc.)
flop_count_sp : Number of single-precision floating-point operations executed by non-predicated threads (add, multiply, and multiply-accumulate). Each multiply-accumulate operation contributes 2 to the count. The count does not include special operations.

I want to know what is the exact different between FP instruction and FP operation? Such separation sounds like you can do a FP addition with non FP instruction. Is that right?!

For example, in my analysis, I see the following values roughly:

inst_fp_32 = 400M
flop_count_sp = 800M

A FFMA is a FP instruction. it is what appears in the output of cuobjdump -sass, for example.

A FP operation is something like add, subtract, multiply.

The two are related, but not always the same. FFMA is a good example, it is a “Fused Multiply-Add”.

It is a single instruction

It counts as 2 operations (a multiply operation and an add operation)

both the binary utilities like cuobjdump as well as an overview of the machine instruction sets is given in the documentation:

https://docs.nvidia.com/cuda/cuda-binary-utilities/index.html

The basic floating-point building block of the GPU is the fused multiply-add: FMA (a, b, c) = a * b + c, where the full product enters into the addition unmodified (neither rounded nor truncated), and there is a single rounding at the end.

One FMA instruction thus comprises two floating-point operations. In your example, it seems the entire floating-point activity was composed of FMA instructions. If there were FADD or FMUL instructions in any sizeable quantity, the ratio flop_count_sp to inst_fp_32 would be closer to 1.

Yes you are right. I see

Single precision add = 0
Single precision FMA ~ 400M
Single precision MUL ~200k

So the value of flop_count_sp is about 2 times of instructions. Thank you.

Hi,
Although I understand the concept, still I see some strange numbers which I don’t understand. For example

flop_count_sp_add = 0
flop_count_sp_fma = 0
flop_count_sp_mul = 0
flop_count_sp_special = 23,294,600

Therefore, flop_count_sp is correctly 0 because it is (add+2*fma+mul).
However inst_fp_32 is 351,591,800.

So, the question is what is the rest of instructions? I mean there are 328,297,200 instructions that are neither ADD nor FMA nor MUUL nor SPECIAL.

My magic eight ball is broken, and I flunked my tarot classes back in school. I don’t now what code you are running and neither do I know what GPU architecture you are using.

Generally speaking, there are floating-point instructions other than FADD, FMUL, FFMA, and MUFU. For example, conversion instructions F2F, F2I and I2F, comparison instructions FSET and FSETP, a select-type instruction FCMP, minimum and maximum FMNMX. Some MUFU instructions are designed to work in combination with the range-reduction instruction RRO.

Perhaps you should start to learn how to use the binary tools, and learn about PTX since it is at least somewhat instructive for the SASS code.

The binary tools and PTX manual are both documented at docs.nvidia.com

It’s also possible to see the SASS instruction stream from within a visual profiler like nvvp

Ok. I have read through the manuals about ISA [1]. There I can not see SPECIAL floating point instructions. There is a special reg to reg instruction which is shown in misc instructions.

Maybe that is explained somewhere else. I appreciate if you point to the right manual among so many docs in the website.

[1] https://docs.nvidia.com/cuda/cuda-binary-utilities/index.html#maxwell-pascal

How does the profiler itself describe flop_count_sp_special? The “special” in flop_count_sp_special presumably refers to the MUFU instructions that are implemented in the MUlti Function Unit. This provides the following “special” operations:

MUFU.RCP
MUFU.RSQ
MUFU.SQRT
MUFU.EX2
MUFU.LG2
MUFU.SIN
MUFU.COS
MUFU.RCP64H
MUFU.RSQ64H

Compare also this question on Stack Overflow:

https://stackoverflow.com/questions/44388623/nvidias-nvprof-outputs-for-flops

OK. From what I have understand from the documentation, inst_fp_32 contains ADD, MUL, FMA, special (MUFU) and some other general FP instructions (e.g. FSET instruction). Moreover,

flop_count_sp = ADD + MUL + 2*FMA
inst_fp_32 = flop_count_sp + special + others

Is that correct?

Anything with “inst” in the name is likely counting instructions, not operations. So I would expect:

inst_fp_32 = # of (ADD + MUL + FMA + special + others)

Or, stated differently:

inst_fp_32 = # of (any possible kind of FP32 instruction)

I am not sure what you are trying to accomplish by doing a deep dive on these metrics. Your original question asked whether one can do floating-point additions other than through FADD. You can do so in trivial ways by replacing FADD (a, b) with FMA (a, 1, b) or FADD (a, a) with FMUL (2, a), but that won’t help with anything I could think of. In rare circumstances where both addends are know to be denormals, you could use integer addition to accomplish floating-point addition, but I don’t see any practical use case for this.

I am trying to analyze and compare some cuda codes for instruction mix.

What you said looks reasonable and I can verify for some of my programs. However, I see an inconsistency which is not directly related to this topic.

I profiled Optix with one input and ran it multiple times with different metrics since putting a large number of metrics in one run makes it pretty slow. As you can see in the picture, I see these numbers:

https://pasteboard.co/I7p7ohy.png

FP instructions (single) = 1,034,103,385
ADD = 508,776,395
FMA = 25,158,111
MUL = 533,409,380
SPECIAL = 26,585,889

Therefore, FP < ADD+FMA+MUL+SPECIAL

Sound bizarre… Although the input file was similar, the number of invocations are different and I think that causes such inconsistency.

Not all of my analyses look like that. I mean, I can verify that FP is less than sum of types that I mentioned.

For what purpose? What are you ultimately trying to accomplish? Could this be an XY problem?

I have programmed in CUDA for quite some while, including software optimization, and never had to dive into the details of these profiler counters.

Make sure you don’t get warnings about HW counter warp-around. On many GPUs the HW counters are woefully underdimensioned in the number of bits. The counts in #12 don’t look like that is happening, but it is good to be aware.

Consider filing an enhancement request with NVIDIA to explain the counters in more detail in documentation. You can use the bug reporting form on the developer website for that.

It is part of my research for benchmarking and comparing program codes. I have previously worked with cpu profilers, intel vtune and pin and amd codexl for benchmarking and analyzing cpu base programs. Now I am working with gpu applications.

However, profiling metrics are not very much documented. I have even seen that for some programs, putting metric1 and metric2 in one nvprof command causes the program to crash! So I have to separate them with two commands.

These are common HW profiling limitations, I would think: (1) limited number of metrics that can be extracted in one run (2) restrictions on which metrics can be extracted in the same run. From my time working with VTUNE I recall such restrictions from CPUs, they may have become less restrictive since.

If the metrics above were reported across multiple runs, it stands to reason that they are probably not fully consistent.

OK I came across this solution to normalize the numbers based on the number of invocations. Therefore, for ONE invocation, I get

fp inst = 5,170,516.925
add = 1,905,529.56
fma = 94,225.13
mul = 1,997,787.94
special = 99,572.61

Then other fp instructions are about 1,073,403
Results are better that previous.

So that would leave you with about 1,073,000 “other” FP32 instructions. If the code is not too complicated, you should be able to look at the generated SASS to get the approximate ratio of the various classes of FP32 instruction to see whether the above makes sense.

In terms of efficiency, you might want to examine why the number of individual FADDs and FMULs is so much higher than the number of FMAs. FMA is really the computational workhorse of the GPU.

Other than some host compilers, the CUDA compiler will not re-associate floating-point computations on its own, with the exception of applying FMA-contraction to dependent FADD/FMUL pairs (this can be turned off with -fmad=false).