Faster Texture Access Effeciently correct tranposed results

I noticed a little problem when accessing textures for filtering and writing the filter result to device memory.

Consider the following simplified CUDA program:

texture<float, 2> tex;

__global__ void fooFilter(float* odata, int width) {

	int x = blockIdx.x * blockDim.x + threadIdx.x;

	int y = blockIdx.y * blockDim.y + threadIdx.y;

	float result = tex2D(tex, x, y); // do some filtering

	odata[y * width + x] = result;


int main() {


	dim3 threads(16, 16);

	dim3 blocks(WIDTH / threads.x, HEIGHT / threads.y);

	fooFilter<<<blocks, threads>>>(d_result, WIDTH);



Since the origin of texture coordinate system is in the lower left corner and not in the top left corner, the kernel will transpose the texture when writing it back to odata. So to get a correct result matrix one must access the result array rather like [font=“Courier New”]odata[x * width + y] = result;[/font].

However accessing the device memory this way costs performance and almost doubles the runtime of the kernel.

Is there a way to solve this problem with less costs in performance? I already tried to make tex a 1D texture and to use shared memory when writing back. Both showed no difference in runtime.

I think that this is easier that it looks :rolleyes:

Instead of flipping the buffer, you can flip the texture access, that is really simple in the case of normalized textures

float result = tex2D(tex, x, 1-y); // do some filtering

	odata[y * width + x] = result;

I think that this should work…

Oh well… of course I tried something like that already, but I was doing it wrong, because in my real project it is all a little bit more complicated. But now I found that mistake and everything works fine. Thanks!