Another thing on the instruction level analysis: It seems to me, that my kernel shouldn´t do too many double precision instructions (by code inspection). However, the “CUDA Achieved FLOPS” experiment gives me roughly the same amount of double instructions as single instructions (close to a billion). I hope my conlcusion is correct, that if I could eliminate these double precision instructions, my overall performance should rise (as the K10 has a single precision performance around 20 times higher then double precision).
Up to now, I used the Source View in the analysis report window (.nvreport) to inspect the ptx code, howwever, this windows doesn´t seem to make a correspondence between ptx (and/or sass) lines to CUDA C code lines …
- Do you think that the statistics taken from the analyser are correct?
- Is my assumption correct, that if I eliminate or substitute the double precision instructions with single precision instructions my performance will rise?
- How are you guys inspecting the code? Do you have any hints for me?