Writing global memory 14 times slower than reading?

Nikratio · January 18, 2011, 12:50am

Hello,

I am observing that writing to global memory takes 14 times as much time as reading from it. I am using an integrated GPU. Am I doing something wrong, or are differences of this order to be expected?

The kernel looks essentially like this:

for(iter=0; iter < ITERATIONS; iter++) {

 	temp = some_function();

 	matrix[tid] = temp;

 	__syncthreads();

	}

[font=“Courier New”]matrix[/font] is in global memory and [font=“Courier New”]temp[/font] is in a register. By changing

matrix[tid] = temp

to

temp = matrix[tid]

the code becomes 14 times faster (but obviously also useless).

Thanks,

Nikolaus

SPWorley · January 18, 2011, 1:42am

It’s faster because the compiler is smart. It’s properly optimizing the program to remove the now-dead code. Since you never use the variable temp, it’s thrown out, and the compiler probably reduces your whole test kernel to a no-op.

Nikratio · January 18, 2011, 2:06am

No, that’s not it, because the following code also shows the speedup:

temp = 0;

 for(iter=0; iter < ITERATIONS; iter++) {

 temp += some_function();

 	temp += matrix[tid];

 __syncthreads();

	}

	matrix[tid] = temp;

This is a 1.1 device, so I believe matrix[tid] is also not cached.

avidday · January 18, 2011, 7:07am

It really is compiler optimization. In the second example if temp isn’t stored to global memory, the whole loop can be discarded as redundant.

Nikratio · January 18, 2011, 2:42pm

But “matrix” is in global memory, and temp is stored in there. But let’s not get bogged down with code details. So I should expect global reads and writes to take the same amout of time?

avidday · January 18, 2011, 3:14pm

If anything writes should have less effective latency that reads, because they are “fire and forget”.

DrAnderson42 · January 19, 2011, 12:31pm

Under ideal circumstances (all reads/writes coalesced), writes and reads should perform fairly equally. I benchmarked this long ago, look for a bw_test posted to the forums by MisterAnderson42. The trick to beating the dead code optimizer is to put the value you read into shared memory - then the dead code optimizer assumes that some other thread just may read it and doesn’t optimize the read away (unless CUDA 3.2 has gotten smarter in this regard).

In more common situations, where coalescing isn’t always 100%, (and on compute 2.0 and newer), writes are often slower then reads. Why? Reads are cached in L1. While writes are cached in L2, the crossbar that moves the write from the SM to L2 deals only in 128-byte contiguous segments. If your write is “uncoalesced”, then multiple 128-byte transactions are made to the L2.

Topic		Replies	Views
Why is my global memory write so slow? CUDA Programming and Performance	2	3954	November 26, 2008
read from global mem vs write to global mem CUDA Programming and Performance	13	6538	January 22, 2009
Global memory write cost CUDA Programming and Performance	4	7977	March 11, 2011
strage low of writing global mem CUDA Programming and Performance	5	2207	February 22, 2012
How to write efficient from local to glocal memory Writing - time problems CUDA Programming and Performance	3	5571	December 5, 2007
Why no obvious difference between two ways? Writing to global memory CUDA Programming and Performance	1	4470	December 23, 2009
Extremely long delay to affect a variable stored into global memory CUDA Programming and Performance	3	1677	April 4, 2011
evaluating global memory access trade-off CUDA Programming and Performance	0	860	April 2, 2009
warp scheduling CUDA Programming and Performance	5	2738	August 7, 2009
Strange global memory behaviour CUDA Programming and Performance	1	1870	July 23, 2009

Writing global memory 14 times slower than reading?

Related topics