texture access performance issues


Can someone explain me why the following kernel gives me poor results?

__global__ void textureCopy(uchar4* texOut) {

	int x = blockIdx.x * blockDim.x + threadIdx.x;

	int y = blockIdx.y * blockDim.y + threadIdx.y;

	texOut[y * gridDim.x * blockDim.x+ x] = tex2D(tex, x, y);


I’m using a uchar4 texture (

static texture<uchar4, 2, cudaReadModeElementType> tex;

), so there’s no bank conflicts. Also, the kernel configurations has these values:

threads (32, 12, 1)

blocks (99,195, 1)

The results obtained were (after 5 kernel executions):

Execution time: 1.347171ms

Processing rate: 5502.73 Mpixels/sec

Bandwidth: 22.010920 GB/s

Only ~22 GB/s in a single texture copy…

You could try using __mul24 instead of *, or even using shifts if possible. Apart from that I see nothing wrong or unoptimal in your code.

I’m getting about the same results… Maybe these are the maximum values… =/

It could be… if you want maximum performance you should use coalesced memory reads, not textures

But my memory reads are coalesced!

Your writes are coalesced, your reads use the texture units which have no such concept.

It depends if one is reading data that can be heavily cached - in that case texture bandwidth can be ~150 Gbytes/sec.

In any case one should be able to get at least 60 GB/s on a one-one texture copy (see GPUbench - it uses float4s). In my own code I have noticed that switching from float2 to ushort2, which should halve the performance, has a much smaller effect. I had chalked this up to becoming latency bound, but perhaps the texture unit just isn’t very efficient with these data types?

It also seems possible that you could actually be write bound. Have you tried just writing out a constant and seeing how fast that goes?

Yeap. I tried that. My problem was the same on this post http://forums.nvidia.com/index.php?showtopic=54599

I used the timing functions from CUT instead of using cudaEventRecord etc.