CUDA Sample

Hi, I’m a beginner in CUDA program, and I’ve tried to run some CUDA samples provided by Nvidia. I’m interested in some floating-point function acceleration (FP32) such as 1/x, sqrt(x), 1/sqrt(x), exp2(x), log2(x).

  1. Are those functions accelerated by special function unit (SFU) as Fermi architecture does for latest GPUs?

I’ve tried to compile the CUDA samples and generate the PTX files. There are some operations such as rcp.rn.f32, sqrt.rn.f32 among those PTX files.

2.Is there any method to trace and log the input/output data for these operations? I find some Debugger Tool provided by Nvidia, but I don’t make sure the tool is helpful.

1 Like

(1) For code analysis, always look at the machine code (SASS), not PTX. PTX is a virtual ISA and compiler intermediate representation which is compiled by an optimizing compiler into SASS. You can look at SASS by disassembling CUDA-generated executables with cuobjdump --dump-sass

(2) rcp.rn.f32, sqrt.rn.f32 are properly rounded (in the IEEE-754 sense) reciprocal and square root operations that have no direct hardware equivalent. You will see a longish sequence of SASS operations generated for these.

(3) FP32 special function units exist in all GPUs supported by CUDA. They are actually called MUFU (multi-function unit) and you would be looking for operations like MUFU.EX2, MUFU.RCP, etc in SASS. In Pascal and later architectures, an approximate square root was added to the previously existing set of MUFU operations.

(4) The CUDA compiler defaults to IEEE-754 compliant basic FP32 operations and high-accuracy implementations of FP32 math functions. To get faster approximate versions of some of them, you would want to either use device function intrinsics, such as __log2f(), and/or use compiler switches such a -prec-sqrt=false, -prec-div=false, -use_fast_math. Consult the documentation, in particular the Best Practices Guide, the Programming Guide, and the nvcc documentation.

Thank you for the reply. But I still have some questions in logging the input/output data for MUFU (rcp, sqrt, log2) instructions. I’ve read some document about Nsight debugger tool (Eclipse edition)

1.Is there any detail document or turtorial such as visual studio Nsight edition?
2.Is the Eclipse edition supports the GPU core dump information as the visual studio edition?
3.Is there any better method to dump the input/ouput data for MUFU instructions?