NSight 3.0 - How to find out which instructions are double precision?


Another thing on the instruction level analysis: It seems to me, that my kernel shouldn´t do too many double precision instructions (by code inspection). However, the “CUDA Achieved FLOPS” experiment gives me roughly the same amount of double instructions as single instructions (close to a billion). I hope my conlcusion is correct, that if I could eliminate these double precision instructions, my overall performance should rise (as the K10 has a single precision performance around 20 times higher then double precision).

Up to now, I used the Source View in the analysis report window (.nvreport) to inspect the ptx code, howwever, this windows doesn´t seem to make a correspondence between ptx (and/or sass) lines to CUDA C code lines …


  1. Do you think that the statistics taken from the analyser are correct?
  2. Is my assumption correct, that if I eliminate or substitute the double precision instructions with single precision instructions my performance will rise?
  3. How are you guys inspecting the code? Do you have any hints for me?



To get code correlation between C source code, PTX, and SASS you have to tell the compiler to generate debug information or line information. To enable generation of line information follow these steps:

  1. Open Solution Explorer
  2. Right click on your .cu file or project file
  3. Execute the Properties command
  4. In the left pane select the tree node Configuration Properties | CUDA C/C++ | Device
  5. In the right pane change Generate Line Number Information to Yes

The compiler makes a best attempt to maintain line information but for some optimizations it simply can’t maintain the correlation so the source to SASS information may not be correct.

In order to find the double precision instructions you can go to the SASS view and search for the following functions:

  • DADD
  • DMUL
  • DFMA

The most common reasons for double precision are

  1. Floating point constant specified without size specifier (1.0 is double, 1.0f is float).
  2. Calling double precision math operations (sin() vs. sinf())
  3. promotion of float to double for vargs (printf("%f", 1.0f) requires upconvert of 1.0f to double)

Removing use of double on a CC 3.0 architecture can improve instruction throughput and reduce register pressure.

In your specific case running the Instruction Count experiment on a kernel built with Debug configuration may provide you the quickest means for mapping the SASS instruction to the C source line.

Thanks for the suggestions. Especially looking for DADD, DMUL, DFMA pointed me to instructions where I forgot to indicate that a literal is float instead of double. These instructions were in the core of my kernel …

So now I eliminated all double precision instructions and increased my SP perfomance to 238 GFLOPs on one GK104 chip. Before I had around 30 GFLOPs for SP and DP.