Ok I am removing distortion from images and then rectifying a stereo pair. I have two lookup maps using textures, 1 for the distortion map and one for the rectification map.
I first run a undistort kernel that uses the distortion map texture to create an undistorted image. I then want to bind that result to a new texture to use in a rectification kernel that will use the rectification map texture to fetch data in the undistort result texture. (I hope this makes sense)
I have it working but after ther undistor kernel I am doing the following on the host:
My question is whether I have to do the cudaMemcpyToArray() or can I just bind the undistort result memory directly to a texture? I tried just doing a cudaBindTexture() but I get black images back. I assume this is because I am doing 2D texture fetches on linear data.
I want to avoid the devicetodevice memcpy as it is a bottle neck (reduces speed by ~10Hz).
I don’t think there is a way to do 2D tex cache lookups using just device memory.
As a side note, cudaMemcpyToArray seems to have a performane bug in CUDA 1.1. Search the forums for details: IIRC, it only manaages to sustain ~8 GiB/s when it should be capable of 70.
Your right, the programming guide states that you can only do 1D tex fetching using device memory.
I am seeing ~1.5GB/s, but I am copying from cudaMalloc() allocated memory to a 2D array. After some reading I understand that I should be allocating with cudaMallocPitch() since I will always copy to a 2D array. If I am using cudaMallocPitch() to allocate a 1024x768 image will padding be added to my allocation for coelescing even though 1024x768 is a multiple of 32? And if no padding is added will this result in faster devicetodevice transfers to a 2D cuda array?
Padding should only affect memory coalescing when reading directly from device memory. I’ve never noticed a difference when using padded memory and cudaMemcpyToArray.
I know back in CUDA 0.8 cudaMemcpyToArray reached 70 GiB/s, so the latest version’s problems must be a software of some kind. I’m not sure if 1.0 has the problem or not.
Hmm, you have a point. I didn’t see any issues in the bandwidth test I did in that post. But I was already allocating a memory buffer with a multiple of 16 pitch, so using cudaMallocPitch would not have made a difference.
I haven’t used any arrays in my application since CUDA 0.8 (switched to tex1Dfetch and device memory) so I don’t have any recent experience with them other than that benchmark. You could modify the benchmark to use a width of 513 or something and see if that changes the performance for the worse as the manual section you quoted suggests.