I might be asking a stupid question here, but I’m qurious to whether it is possible to convert (cast) input data from double-precision to single-precision floats on a device with compute capability < 1.3.
Since I’m not known with storage methods of different data types I can probably spend way to much time figuring this out for myself. I found an example of someone who succesfully made it working on a device (ARM 7) not supporting double-precision data types: (devmaster.net: “Convert double to float”) but to be honest, I don’t get much of what he did let alone trying to port this to CUDA.
My project involves making CUDA available to another software package which uses double-precision as it’s native data type. So all data I now receive is being cast to single-precision first before transfered to the device. This takes almost 20-40% of the total time required to execute my software, depending on input data size. I know not to expect such a boost if it’s possible (higher kernel execution time and data-transfer times) but I think the performance will benefit from this.
If someone can point me to the right direction, or give me a simple ‘no’, that would be greatly appreciated. But code examples are ofcourse always welcome ^_^
The code example you linked to does exactly what you want but it is quite inefficient compared to doing it in hardware - you can probably do a much better job by optimising the code on the host side to improve the throughput when converting double to float, e.g. by using SSE’s CVTP2PS instruction (use the _mm_cvtpd_ps intrinsic).
Thanks for your replies! I’ll look into both options and let you know which one works best (performance wise). I need to read up on SSE usage and what requirements this puts on my project, but since it needs to be compiled on every system it’s used on, that probably wont be a problem.