Writing global memory 14 times slower than reading?

Hello,

I am observing that writing to global memory takes 14 times as much time as reading from it. I am using an integrated GPU. Am I doing something wrong, or are differences of this order to be expected?

The kernel looks essentially like this:

for(iter=0; iter < ITERATIONS; iter++) {

 	temp = some_function();

 	matrix[tid] = temp;

 	__syncthreads();

	}

[font=“Courier New”]matrix[/font] is in global memory and [font=“Courier New”]temp[/font] is in a register. By changing

matrix[tid] = temp

to

temp = matrix[tid]

the code becomes 14 times faster (but obviously also useless).

Thanks,

Nikolaus

It’s faster because the compiler is smart. It’s properly optimizing the program to remove the now-dead code. Since you never use the variable temp, it’s thrown out, and the compiler probably reduces your whole test kernel to a no-op.

No, that’s not it, because the following code also shows the speedup:

temp = 0;

 for(iter=0; iter < ITERATIONS; iter++) {

 temp += some_function();

 	temp += matrix[tid];

 __syncthreads();

	}

	matrix[tid] = temp;

This is a 1.1 device, so I believe matrix[tid] is also not cached.

It really is compiler optimization. In the second example if temp isn’t stored to global memory, the whole loop can be discarded as redundant.

But “matrix” is in global memory, and temp is stored in there. But let’s not get bogged down with code details. So I should expect global reads and writes to take the same amout of time?

If anything writes should have less effective latency that reads, because they are “fire and forget”.

Under ideal circumstances (all reads/writes coalesced), writes and reads should perform fairly equally. I benchmarked this long ago, look for a bw_test posted to the forums by MisterAnderson42. The trick to beating the dead code optimizer is to put the value you read into shared memory - then the dead code optimizer assumes that some other thread just may read it and doesn’t optimize the read away (unless CUDA 3.2 has gotten smarter in this regard).

In more common situations, where coalescing isn’t always 100%, (and on compute 2.0 and newer), writes are often slower then reads. Why? Reads are cached in L1. While writes are cached in L2, the crossbar that moves the write from the SM to L2 deals only in 128-byte contiguous segments. If your write is “uncoalesced”, then multiple 128-byte transactions are made to the L2.