Time to write to global memory

Hello,
I notice that writting in global memory is long in my process.
I would like to know if the time to write à 32 bits or 8 bits variables are the same?

Yes, it is as expensive to write a single byte as it is to write an entire (aligned) word of 32 bits. This implies that it is much more efficient to write a given amount of data in chunks of words than in single bytes.

How did you time the writes?
There is a common fallacy that the time needed to write out data can be determined from commenting out the write and comparing the runtime to the original runtime of the unmodified code. This methodology usually overestimates the time for writing out results to memory, as the highly optimizing CUDA compiler will also remove any code needed to obtain the results that are now unused.

In most situations, I would claim “vastly overestimates” is the appropriate characterization. It should also be noted that the timing of global memory writes should rarely matter in CUDA programming; to first order, writes are “fire and forget”.

Now, the latency of global memory loads is a different matter, and the CUDA compiler will work hard to schedule these early in support of the basic latency tolerance provided via CUDA’s multi-threading. By using restricted pointers, CUDA programmer can offer additional information to the compiler to further ease such load re-ordering.

I also do that tera . On one of my program the time drops from 20ms to 3ms just because I remove the write to the global memory
What should I expect about the time if I use 32bits instead of 8bits? The time to write the same size of data should be divide by 4?

Can you send us your result xav12358 I am interested.

Regarding the comment that global writes are ‘fire and forget’, what is the reason that global memory writes are not as critical (with regard to coalescing etc.) as global memory loads ? And what about global memory loads which are read via the texture path (__ldg or texture objects), am I right that the texture cache acts as a ‘coalescer’ ?

When executing a global read, there are always dependent instructions waiting for the load data. With writes, there are no dependent instructions that need to wait, other than in the case of a write-read dependency, in which case we are back to the issue of load latency. I don’t know whether GPUs have store-to-load forwarding like CPUs to speed up this process when stores are still in flight. So writes are rarely performance critical, as long as the data gets to the DRAM eventually, things are A-OK.

Since the texture path is non-coherent, it is used only for data that is read-only for the duration of the kernel, therefore we never encounter a write-read dependency. Other than the non-coherency, the differs from a regular L1 cache in that it is being differently organized to optimize for access patterns with 2-dimensional locality.