I am observing that writing to global memory takes 14 times as much time as reading from it. I am using an integrated GPU. Am I doing something wrong, or are differences of this order to be expected?
It’s faster because the compiler is smart. It’s properly optimizing the program to remove the now-dead code. Since you never use the variable temp, it’s thrown out, and the compiler probably reduces your whole test kernel to a no-op.
But “matrix” is in global memory, and temp is stored in there. But let’s not get bogged down with code details. So I should expect global reads and writes to take the same amout of time?
Under ideal circumstances (all reads/writes coalesced), writes and reads should perform fairly equally. I benchmarked this long ago, look for a bw_test posted to the forums by MisterAnderson42. The trick to beating the dead code optimizer is to put the value you read into shared memory - then the dead code optimizer assumes that some other thread just may read it and doesn’t optimize the read away (unless CUDA 3.2 has gotten smarter in this regard).
In more common situations, where coalescing isn’t always 100%, (and on compute 2.0 and newer), writes are often slower then reads. Why? Reads are cached in L1. While writes are cached in L2, the crossbar that moves the write from the SM to L2 deals only in 128-byte contiguous segments. If your write is “uncoalesced”, then multiple 128-byte transactions are made to the L2.