Does anyone have more efficient ways to copy "double" array to texture memory from host to d

Recently I’m confronted with a problem that how to copy a “double” array to texture memory or “cuArray” from host to device?

I have solved this problem by the following step:
1. Copy the host’s “double” array to a host’s “float” array.
2. Copy the host’s “float” array to texture memory(“cuArray”) using “cudaMemcpyToArray” so that I can use interpolation feature.

But I think there must be some more efficient ways to solve this problem, I would appreciate it if someone can help me.

There are no copy mechanisms that can automagically convert an array of doubles to an array of float as part of the copy process. So your current approach looks reasonable. Employing an intermediate conversion buffer on the host (as opposed to on the device) seems appropriate as this reduces the volume of data that needs to be copied over PCIe. Given the low accuracy of the texture interpolation (much less than single precision) I am wondering why the source data is kept as double in the first place? Could you switch the upstream computation to float, avoiding the intermediate copy?

As you are aware, there is no support for double textures. If you don’t need texture interpolation you can utilize the texture cache by accessing double data via tex1Dfetch(). To do so, you would bind an int2 texture to the array of doubles, and re-interprete the int2 returned by tex1Dfetch() using __hiloint2double():

int2 val = tex1Dfetch (texture, index);

double r = __hiloint2double (val.y, val.x);

Thank you for your suggestions. The source data is taken from matlab, and it will be more convenient when saved in “double”. So I must cast it to “float” if I want to take advantage of texture.

You could allocate two small (~64k) pinned buffers, and use streams and async copies to convert into one buffer while copying the other to the device. That way you can achieve the same throughput as a synchronous copy from unpinned memory.