I am trying to figure out the overhead of the F2I and I2F instructions in terms of cycles for a quantized convolution kernel in CUDNN. Nsight shows the portion of all instructions that the XU pipeline takes, but not the portion of the cycles.

The proportion of cycles taken up by the TC (INT), ALU, and the FMA pipelines adds up to 96.93%. Would it be fair to say that XU takes up at most 3.07%? Or are the data conversion instructions excluded from the cycles computation and can’t be calculated this way?

For existing architectures the same metric would be used for both the left and the right chart for XU pipe utilization as all XU instructions have the same throughput. This is also true for ALU pipe but there the section file has added the metric to both charts.

This is a tangential question, but how can I determine the latency of the XU instructions from an Nsight trace? From your response I realized that my understanding of the left chart was wrong, so I am not sure how to determine that metric. Or at least get the total number of cycles taken up by the XU pipeline.

In the CUDA Programming Guide Throughput of Native Arithmetic Instructions (Number of Results per Clock Cycle per Multiprocessor) the throughput of the row 32-bit floating-point reciprocal, reciprocal square root, base-2 logarithm (__log2f ), base 2 exponential (exp2f ), sine (__sinf ), cosine (__cosf ) is 16 threads/cycle for SM7.x - 9.0.

In general the CUDA Programming Guide documents the throughput not the latency of instructions. The only method to calculate latency is to write microbenchmarks.