I have this code which takes one value, copies it on device, there it gets assigned to another variable, which is copied back to the host and printed. However, this does not work:
Check the return codes for errors. What compute capability is your device (is it capable of double precision calculations) and how do you compile the code?
Operations on double-precision data are supported from compute capability 1.3 on. Does your GPU support capability 1.3 or better? If, so, are you compiling with -arch=sm_13 (or targeting a higher compute capability)? Otherwise the compiler defaults to generating code for compute capability 1.0. You should see a compiler warning when targeting a compute capability less than 1.3: “warning : Double is not supported. Demoting to float”.
If you are already compiling with -arch=sm_13 or a similar option and are running on a GPU with a device capability of 1.3 or better, and the program still produces unexpected results, I would suggest adding error checking to the various CUDA API calls (this is good idea in general).
Sorry, I forgot to refresh before sending off the previous post. Given that you are targeting a device with compute capability 1.1 your results are not surprising.
Note that variables of type float take up four bytes while variables of type double occupy eight bytes. With the double-to-float demotion on the device side the kernel now operates on four-byte quantities, while the host passes and receives the data as eight-byte quantities. Chaos ensues :-)
In general it is a good idea to heed the warning about double-to-float demotion. In some instances it can be okay to let it slip, for example when the warning is triggered only by missing ‘f’ suffixes on literal floating-point constants and math function names in code that is otherwise single-precision only. However, even in those cases I would highly recommend making the code “float clean”.
Thank you very much… I thought that writing a CUDA app is “platform ignorant” and the compiler makes sure that it gets compiled into whatever version one is using. If I wanted to port my app on, say, 2.1 and benefit from full double precision, can it be achieved without rewriting the code?
Compute capabilities are like the layers of an onion. Each new capability version is a superset of the previous capability. The idea is that an app that runs successfully at capability N keeps running when moved to devices of capability N+1, N+2, etc. It doesn’t work the other way around.
Double-precision operations are supported starting at compute capability 1.3, and an app that works correctly on a compute capability 1.3 device should run correctly on compute capability 2.0 (Fermi) parts, for example.