Overlap of Data Transfer and Kernel Execution

I want to overlap kernel execution with a cudaMemcpy3D call which copies data from a linear array in device memory to a 3D CUDA array also in device memory (of course). I need to bind a texture to this CUDA array so I can use linear filtering.

In researching this functionality, I found the following in the programming guide v3.2 (Note underline and bold face)

[indent]Overlap of Data Transfer and Kernel Execution
Some devices of compute capability 1.1 and higher can perform copies between
page-locked host memory and device memory concurrently with kernel execution.
Applications may query this capability by calling cudaGetDeviceProperties()
and checking the deviceOverlap property. This capability is currently supported
only for memory copies that do not involve CUDA arrays
or 2D arrays allocated
through cudaMallocPitch() (see Section 3.2.1).[/indent]

Can someone explain why this is so? Is there any way around this?

I suppose I could just bind a 1D texture to the linear array and do the interpolation myself… but then my kernel might slow down too much.


Related questions: When will we be able to save to a texture? Or at least have access to hardware interpolation functions (however complicated it may be)?
Might there be a way to trick a texture into binding to a CUDA array created for a surface?

Thanks In Advance :thanks:

Maybe I am misunderstanding your questions.

(1) CUDA surfaces are basically writeable textures. See sections and B.9 of the CUDA 3.2 Programming Guide.
(2) Texture interpolation can be turned on by setting the filterMode attribute in a texture reference to cudaFilterModeLinear. See section of the CUDA 3.2 Programming Guide.

There is no filter mode property for a surface listed anywhere.

Are you saying that I can write to a CUDA array through a surface reference, and then read the data from that same CUDA array using a texture reference with linear filtering enabled?

Have you done this??

Will it work with 3D textures?


Yes. A 1D/2D cuda array can be writen through a surface reference. Then you can get interpolated values through texture reference from that cuda array.
Please check the simpleSurfaceWrite example in the NVIDIA_GPU_Computing_SDK/C/src location. But it does not support 3D texture in version 3.2.

However, my preliminary test shows that the cuda array writing through surface reference is SLOWER than cudaMemcpyToArray(). Basically I revise
the simpleSurfaceWrite example to copy data from GPU global memory to GPU texture array, and compare:

cutilSafeCall( cudaMemcpyToArray (cu_array, 0, 0, d_data, size, cudaMemcpyHostToDevice) ) 0.4ms on the average


cutilSafeCall( cudaBindSurfaceToArray( output_surface, cu_array ) 0.8ms on the average
surfaceWriteKernel <<< dimGrid, dimBlock >>> (d_data, 512, 512)

I am running linux desktop, and I am using tesla C2050.

Can anyone point out to me the potential benefits of surface read/write? My experiences show that data transfer between GPU golbal memory to GPU texture
memory is time-consuming compared to GPU computation, and I am looking for alternative ways to avoid this data transfer as much as possible. It would
be greatly appreciated if some suggestions can be provided too.