On the page: Programming Guide :: CUDA Toolkit Documentation
they use src_shared in the example, which is not defined. I assume they meant src_global?
And on the page: Kernel Profiling Guide :: Nsight Compute Documentation
it is stated:
fp16 pipeline: […] It also contains a fast FP32-to-FP16 and FP16-to-FP32 converter. Starting with GA10x chips, this functionality is part of the FMA pipeline.
alu pipeline: […] On NVIDIA Ampere architecture chips, the ALU pipeline performs fast FP32-to-FP16 conversion.
My understanding is that GA10x is an Ampere architecture. So which pipe does the FP32-to-FP16 conversion, the FMA pipeline, the ALU pipeline or FP16 pipeline? (edit: Submitted this question to the profiler forum, here.)