I want to overlap kernel execution with a cudaMemcpy3D call which copies data from a linear array in device memory to a 3D CUDA array also in device memory (of course). I need to bind a texture to this CUDA array so I can use linear filtering.
In researching this functionality, I found the following in the programming guide v3.2 (Note underline and bold face)
[indent]Overlap of Data Transfer and Kernel Execution
Some devices of compute capability 1.1 and higher can perform copies between
page-locked host memory and device memory concurrently with kernel execution.
Applications may query this capability by calling cudaGetDeviceProperties()
and checking the deviceOverlap property. This capability is currently supported
only for memory copies that do not involve CUDA arrays or 2D arrays allocated
through cudaMallocPitch() (see Section 3.2.1).[/indent]
Can someone explain why this is so? Is there any way around this?
I suppose I could just bind a 1D texture to the linear array and do the interpolation myself… but then my kernel might slow down too much.
Related questions: When will we be able to save to a texture? Or at least have access to hardware interpolation functions (however complicated it may be)?
Might there be a way to trick a texture into binding to a CUDA array created for a surface?
Thanks In Advance