XU Pipeline Cycles

barrotdev · December 11, 2023, 8:38pm

I am trying to figure out the overhead of the F2I and I2F instructions in terms of cycles for a quantized convolution kernel in CUDNN. Nsight shows the portion of all instructions that the XU pipeline takes, but not the portion of the cycles.

Here is an example capture:

The proportion of cycles taken up by the TC (INT), ALU, and the FMA pipelines adds up to 96.93%. Would it be fair to say that XU takes up at most 3.07%? Or are the data conversion instructions excluded from the cycles computation and can’t be calculated this way?

Greg · December 14, 2023, 4:32am

For existing architectures the same metric would be used for both the left and the right chart for XU pipe utilization as all XU instructions have the same throughput. This is also true for ALU pipe but there the section file has added the metric to both charts.

barrotdev · December 14, 2023, 5:56pm

This is a tangential question, but how can I determine the latency of the XU instructions from an Nsight trace? From your response I realized that my understanding of the left chart was wrong, so I am not sure how to determine that metric. Or at least get the total number of cycles taken up by the XU pipeline.

Greg · February 8, 2024, 3:26am

In the CUDA Programming Guide Throughput of Native Arithmetic Instructions (Number of Results per Clock Cycle per Multiprocessor) the throughput of the row 32-bit floating-point reciprocal, reciprocal square root, base-2 logarithm (__log2f ), base 2 exponential (exp2f ), sine (__sinf ), cosine (__cosf ) is 16 threads/cycle for SM7.x - 9.0.

In general the CUDA Programming Guide documents the throughput not the latency of instructions. The only method to calculate latency is to write microbenchmarks.

veraj · February 22, 2024, 5:40am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Which counter(s) record sin/cos/sqrt etc. FLOPS? Nsight Compute	4	143	November 8, 2024
Instruction Latency CUDA Programming and Performance	18	43612	January 18, 2010
How to profile how many times an instruction is executed or how much duration it takes? Nsight Compute cuda , kernel , profiling	2	671	January 12, 2024
Why Low Tensor Pipe Utilization CUDA Programming and Performance cuda , kernel	4	1184	May 20, 2022
Question about how to know the (PTX) hsub2 and (PTX) vsub4 latency CUDA Programming and Performance	1	18	September 26, 2024
Separate CUDA Core pipeline for FP16 and FP32? Nsight Compute	11	283	August 20, 2024
Throughputs of the 64-bit sine and cosine instructions CUDA Programming and Performance	2	453	January 31, 2022
Kernel with very low eligible warps despite fully coalesced memory access CUDA Programming and Performance	7	1000	July 17, 2023
Metrics smsp__sass_thread_inst_executed_op* returns n/a Nsight Compute	8	1739	August 2, 2019
Fast CUDA implementation for calculating cross-norm-distance of two matrices CUDA Programming and Performance cuda	0	923	April 8, 2021

XU Pipeline Cycles

Related topics