Time to write to global memory

xav12358 · November 4, 2016, 5:02pm

Hello,
I notice that writting in global memory is long in my process.
I would like to know if the time to write à 32 bits or 8 bits variables are the same?

tera · November 4, 2016, 5:39pm

Yes, it is as expensive to write a single byte as it is to write an entire (aligned) word of 32 bits. This implies that it is much more efficient to write a given amount of data in chunks of words than in single bytes.

How did you time the writes?
There is a common fallacy that the time needed to write out data can be determined from commenting out the write and comparing the runtime to the original runtime of the unmodified code. This methodology usually overestimates the time for writing out results to memory, as the highly optimizing CUDA compiler will also remove any code needed to obtain the results that are now unused.

njuffa · November 4, 2016, 6:26pm

In most situations, I would claim “vastly overestimates” is the appropriate characterization. It should also be noted that the timing of global memory writes should rarely matter in CUDA programming; to first order, writes are “fire and forget”.

Now, the latency of global memory loads is a different matter, and the CUDA compiler will work hard to schedule these early in support of the basic latency tolerance provided via CUDA’s multi-threading. By using restricted pointers, CUDA programmer can offer additional information to the compiler to further ease such load re-ordering.

xav123 · November 5, 2016, 8:23am

I also do that tera . On one of my program the time drops from 20ms to 3ms just because I remove the write to the global memory
What should I expect about the time if I use 32bits instead of 8bits? The time to write the same size of data should be divide by 4?

Can you send us your result xav12358 I am interested.

HannesF99 · November 8, 2016, 8:01am

Regarding the comment that global writes are ‘fire and forget’, what is the reason that global memory writes are not as critical (with regard to coalescing etc.) as global memory loads ? And what about global memory loads which are read via the texture path (__ldg or texture objects), am I right that the texture cache acts as a ‘coalescer’ ?

njuffa · November 8, 2016, 8:47am

When executing a global read, there are always dependent instructions waiting for the load data. With writes, there are no dependent instructions that need to wait, other than in the case of a write-read dependency, in which case we are back to the issue of load latency. I don’t know whether GPUs have store-to-load forwarding like CPUs to speed up this process when stores are still in flight. So writes are rarely performance critical, as long as the data gets to the DRAM eventually, things are A-OK.

Since the texture path is non-coherent, it is used only for data that is read-only for the duration of the kernel, therefore we never encounter a write-read dependency. Other than the non-coherency, the differs from a regular L1 cache in that it is being differently organized to optimize for access patterns with 2-dimensional locality.

Topic		Replies	Views
Latency for writes to global memory CUDA Programming and Performance	5	3423	July 24, 2009
read from global mem vs write to global mem CUDA Programming and Performance	13	6564	January 22, 2009
How do GPUs "handle" writes? CUDA Programming and Performance	12	4014	March 10, 2018
How to write efficient from local to glocal memory Writing - time problems CUDA Programming and Performance	3	5593	December 5, 2007
Why is my global memory write so slow? CUDA Programming and Performance	2	3980	November 26, 2008
Writing global memory 14 times slower than reading? CUDA Programming and Performance	6	10230	January 19, 2011
Global memory write cost CUDA Programming and Performance	4	7996	March 11, 2011
Poor Global Memory Write Speeds CUDA Programming and Performance	2	805	February 3, 2014
strage low of writing global mem CUDA Programming and Performance	5	2223	February 22, 2012
Global memory coalescing Poor write to global memory CUDA Programming and Performance	1	2425	April 20, 2010

Time to write to global memory

Related topics