I noticed a little problem when accessing textures for filtering and writing the filter result to device memory.
Consider the following simplified CUDA program:
texture<float, 2> tex;
__global__ void fooFilter(float* odata, int width) {
int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;
float result = tex2D(tex, x, y); // do some filtering
odata[y * width + x] = result;
}
int main() {
...
dim3 threads(16, 16);
dim3 blocks(WIDTH / threads.x, HEIGHT / threads.y);
fooFilter<<<blocks, threads>>>(d_result, WIDTH);
...
}
Since the origin of texture coordinate system is in the lower left corner and not in the top left corner, the kernel will transpose the texture when writing it back to odata. So to get a correct result matrix one must access the result array rather like [font=“Courier New”]odata[x * width + y] = result;[/font].
However accessing the device memory this way costs performance and almost doubles the runtime of the kernel.
Is there a way to solve this problem with less costs in performance? I already tried to make tex a 1D texture and to use shared memory when writing back. Both showed no difference in runtime.