performance of cudaBindTextureToArray

I was looking at the convolutionTexture example and I noticed that a device to device memory copy was done after the convolutionRowGPU call. Basically it was copying the result of convolutionRowGPU back to the input array that the input texture was initially bound to.

Is there a reason why you wouldn’t just re-bind the texture to the temporary array instead? Do you take a large performance hit when you re-bind a texture to another memory space, or was this done for the sake of simplicity? At the very least I would think that you could use a separate texture object for the temporary array and have the convolutionColGPU use that texture as input instead. Does the number of texture objects used impact performance as well?

Thanks.

After more investigation I see now that there isn’t a way to access cudaArrays other than cudaMemcpy[To/From]Array or through the use of a texture. That is why the example does the memcpy.

I see that you can only make 2D textures from cudaArrays. Is there a plan to support 2D textures with memory allocated with cudaMallocPitch()? Or is there a plan to be able to write to a texture? That would remove the memcpy from the convolutionTexture example. The main reason I want to use textures is to have access to the wrap mode addressing. I guess I can do the wrapping in code but thought it would be easier to just use the texture.

It looks like I need to wrap my brain around the convolutionSeparable example.