CUDA Sample

an924104 · July 13, 2018, 12:11pm

Hi, I’m a beginner in CUDA program, and I’ve tried to run some CUDA samples provided by Nvidia. I’m interested in some floating-point function acceleration (FP32) such as 1/x, sqrt(x), 1/sqrt(x), exp2(x), log2(x).

Are those functions accelerated by special function unit (SFU) as Fermi architecture does for latest GPUs?

I’ve tried to compile the CUDA samples and generate the PTX files. There are some operations such as rcp.rn.f32, sqrt.rn.f32 among those PTX files.

2.Is there any method to trace and log the input/output data for these operations? I find some Debugger Tool provided by Nvidia, but I don’t make sure the tool is helpful.

njuffa · July 13, 2018, 1:24pm

(1) For code analysis, always look at the machine code (SASS), not PTX. PTX is a virtual ISA and compiler intermediate representation which is compiled by an optimizing compiler into SASS. You can look at SASS by disassembling CUDA-generated executables with cuobjdump --dump-sass

(2) rcp.rn.f32, sqrt.rn.f32 are properly rounded (in the IEEE-754 sense) reciprocal and square root operations that have no direct hardware equivalent. You will see a longish sequence of SASS operations generated for these.

(3) FP32 special function units exist in all GPUs supported by CUDA. They are actually called MUFU (multi-function unit) and you would be looking for operations like MUFU.EX2, MUFU.RCP, etc in SASS. In Pascal and later architectures, an approximate square root was added to the previously existing set of MUFU operations.

(4) The CUDA compiler defaults to IEEE-754 compliant basic FP32 operations and high-accuracy implementations of FP32 math functions. To get faster approximate versions of some of them, you would want to either use device function intrinsics, such as __log2f(), and/or use compiler switches such a -prec-sqrt=false, -prec-div=false, -use_fast_math. Consult the documentation, in particular the Best Practices Guide, the Programming Guide, and the nvcc documentation.

an924104 · July 16, 2018, 7:22am

Thank you for the reply. But I still have some questions in logging the input/output data for MUFU (rcp, sqrt, log2) instructions. I’ve read some document about Nsight debugger tool (Eclipse edition)

1.Is there any detail document or turtorial such as visual studio Nsight edition?
2.Is the Eclipse edition supports the GPU core dump information as the visual studio edition?
3.Is there any better method to dump the input/ouput data for MUFU instructions?

Topic		Replies	Views
Going to learn PTX and write a GPU compiler CUDA Programming and Performance	20	26857	January 19, 2009
CUDA slower than MATLAB... again I can't get the simplest examples to show any speed-up using GP CUDA Programming and Performance	5	2518	February 18, 2011
Determining correct compute capability for a loaded PTX file/kernel ? CUDA Programming and Performance	10	2610	February 11, 2015
Compile float as 64bit floating point CUDA Programming and Performance	7	1515	September 25, 2016
Half2 performance CUDA Programming and Performance	4	2641	October 29, 2018
Low Level CUDA C Programming Education CUDA Programming and Performance cuda	2	968	December 13, 2021
instruction or operation CUDA Programming and Performance	16	3241	March 28, 2019
Crowd sourcing request: help me time the PTX ISA. CUDA Programming and Performance	8	1898	July 2, 2019
Same JIT program running on Kepler and Maxwell generate different result CUDA Programming and Performance	5	702	July 26, 2016
performance difference for cuda between experiments and the documentation for float/double data type... CUDA Programming and Performance	8	1903	October 28, 2016

CUDA Sample

Related topics