unsigned short to float conversion on GPU

I’m working a problem where the data comes to me as unsigned shorts (2 Bytes) which I first up-convert to floats for calculation (using cufft and some cublas routines). I’m currently doing this conversion on the CPU, but this will eventually be a bottleneck of the calculation as the CPU is performing other work and wonder if there’s a smart way to do the conversion on the GPU instead. (I can imagine some non-smart approaches with memcpy’s, but suspect that’s not the right approach.)

A search through the forum didn’t find anything, but perhaps I missed it, so apologies if this is duplicated elsewhere.

Copy your data array of short-s to device memory, and then extend your kernel to cast to float before using corresponding element? Or, alternatively, if your threads are using multiple elements at the same time, but the access is still localized, have your kernel to start with copying data from device memory into shared memory, and then have your threads do the conversion first, then sync threads, and then use data from shared memory? Or, as an another alternative, if each of your threads is accessing to many elements of this array, allocate another array, for floats, in device memory, and then write a small kernel to just do the conversion, and run this kernel before your other kernel(s)? Etc. - it really depends on what you’re going to do with these values, once converted to floats.

Use textures, and you get the conversion for “free”. Though I haven’t used CUBLAS much lately, so I don’t remember if it supports reading from textures. If not, you can read from the textures and copy it into another buffer in global memory and use that for CUBLAS/CUFFT/etc.

EDIT: Read section 3.2.4 of the programming guide for more info.