Using 2d texture fetchs without binding to array Can it be done?

Ok I am removing distortion from images and then rectifying a stereo pair. I have two lookup maps using textures, 1 for the distortion map and one for the rectification map.

I first run a undistort kernel that uses the distortion map texture to create an undistorted image. I then want to bind that result to a new texture to use in a rectification kernel that will use the rectification map texture to fetch data in the undistort result texture. (I hope this makes sense)

I have it working but after ther undistor kernel I am doing the following on the host:

CUDA_SAFE_CALL( cudaMemcpyToArray( cu_arrayLu, 0, 0, dResultDatafL, RES_H*RES_V*sizeof(float4), cudaMemcpyDeviceToDevice));

    CUDA_SAFE_CALL( cudaBindTextureToArray( tex_unImageL, cu_arrayLu, channelDesc32));

My question is whether I have to do the cudaMemcpyToArray() or can I just bind the undistort result memory directly to a texture? I tried just doing a cudaBindTexture() but I get black images back. I assume this is because I am doing 2D texture fetches on linear data.

I want to avoid the devicetodevice memcpy as it is a bottle neck (reduces speed by ~10Hz).

I don’t think there is a way to do 2D tex cache lookups using just device memory.

As a side note, cudaMemcpyToArray seems to have a performane bug in CUDA 1.1. Search the forums for details: IIRC, it only manaages to sustain ~8 GiB/s when it should be capable of 70.

Your right, the programming guide states that you can only do 1D tex fetching using device memory.

I am seeing ~1.5GB/s, but I am copying from cudaMalloc() allocated memory to a 2D array. After some reading I understand that I should be allocating with cudaMallocPitch() since I will always copy to a 2D array. If I am using cudaMallocPitch() to allocate a 1024x768 image will padding be added to my allocation for coelescing even though 1024x768 is a multiple of 32? And if no padding is added will this result in faster devicetodevice transfers to a 2D cuda array?

Padding should only affect memory coalescing when reading directly from device memory. I’ve never noticed a difference when using padded memory and cudaMemcpyToArray.

This is the post I was referring to about the performance problems in cudaMemcpyToArray. [url=“http://forums.nvidia.com/index.php?showtopic=56025&hl=cudamemcpytoarray”]The Official NVIDIA Forums | NVIDIA
NVIDIA has acknowledged the issue, and hopefully it will be fixed in the next CUDA release but they won’t comment on that of course.

I know back in CUDA 0.8 cudaMemcpyToArray reached 70 GiB/s, so the latest version’s problems must be a software of some kind. I’m not sure if 1.0 has the problem or not.

D.5.3 of the programming manual states:

So are you saying that you don’t see a difference when doing 2D copies to arrays with memory allocated with cudaMalloc() vs. cudaMallocPitch()?

Thanks, for the link on the bandwidth issue. I’ll be sure to be on the look out for any updates.

Hmm, you have a point. I didn’t see any issues in the bandwidth test I did in that post. But I was already allocating a memory buffer with a multiple of 16 pitch, so using cudaMallocPitch would not have made a difference.

I haven’t used any arrays in my application since CUDA 0.8 (switched to tex1Dfetch and device memory) so I don’t have any recent experience with them other than that benchmark. You could modify the benchmark to use a width of 513 or something and see if that changes the performance for the worse as the manual section you quoted suggests.