I am profiling NVIDIA hardware compared to AMD MI250X - and obviously it’s a little difficult given the different meanings for the various hardware counters.
One issue is that the AMD profilers separate out floating point instructions as ADD/MUL/FMA and TRANS, where the latter counts transcendental functions which include all of sin/cos/sqrt etc.
How does NSIGHT Compute count sin/cos/sincos/sqrt etc. intrinsics? Where in the various metrics can I view calls to, e.g. __sincosf()? Or is this somehow merged into the existing ADD/MUL/FMA counters?
Do these intrinsics contribute to load on the FPU or are they calculated on orthogonal hardware that does not reduce general floating point instruction performance?
A similar question: how many cycles do __sincosf or sqrt take compared to more straightforward operations like ADD/MUL/FMA?
These operations are executed on the xu
pipeline, see 2. Kernel Profiling Guide — NsightCompute 12.6 documentation
Transcendental and Data Type Conversion Unit. The XU pipeline is responsible for special functions such as sin, cos, and reciprocal square root. It is also responsible for int-to-float, and float-to-int type conversions.
Thank you that’s very helpful.
Since this is a separate pipeline, am I correct in assuming that its use doesn’t lock up the standard fma pipeline?
(Also I wanted to thank you @felix_dt for your response to my earlier question - I didn’t get a chance to thank you before the conversation was locked.)
Yes, these pipelines are processed independently.
In general, the HW executes a combination of fixed-latency (i.e., fixed number of cycles after issue) and variable-latency instructions (i.e., waiting for a data dependency). We do not document the number of cycles any of these operations take. You can use the warp stall sampling feature to get an estimate on where in your kernel warps were stalled, and if so, for which stall reason (naturally, if an instruction takes many cycles waiting on a data dependency to load, its chance of stalling the warp are relatively higher). These stall metrics are included by default in the SourceCounters.section
.