I have written a program, which does not use double precision in any obvious way on the device side (as far as I can see). However, it is fairly complex. Most CUDA devices are much faster using single precision. However, when compiling with computation capability 1.0, I get
ptxas somefile.ptx, line xyz Double is not supported. Demoting to float
I was not able to find this double. I have to get rid of it, because it forces me to use sm_10 which does not provide all CUDA features. sm_13 provides a performance drop of about 50% for my device. How can I find this annoying double?
In the ptx code, it appears near a function pointer. Here is a sketch
ld.param.u32 %r12, [__cudaparm__kernelname…;
add.u32 %r13,%r12,%r11;
…
ld.global.f64 %fd1, [%r13+32]
So there is some array which comes with the function call. But I have no idea what exactly fd1 is. All floating point data used in the kernel is stored in arrays of structs, containing only floats. All register variables are float. Why is there a double? Doesn’t make sense to me.
Is there a way to use the ptx code to identify the problem? Is it at all a coding problem, or merely a compiler problem? I tried commenting out parts of the code - resulting in an inconistend behaviour with respect to the double problem above. Sometimes it appeared, sometimes not. I am grateful for any hints to identify the problem.
The usual source of unintended double precision code is constants. All constants intended to be represented in single precision should have a ‘f’ suffix, if not they compile to double precision. But it would probably be simplest if you can post the code passage, because their can be some other, more subtle causes too.
Unfortunately, I do not have any constants. I realized, that I have to cast all numbers to float: e.g. write 0.5f instead of just 0.5. ptx code uses f64 otherwise. I’ll check if it resolves the problem.
Thanks for clearing up. I have changed the constants to floats. Also, is there some compiler setting I could use instead (except sm_10 of course). Now sm_13 is just as fast as sm_10. Wonderful…