Why Reg->shared->global is faster than Reg->global?

Hi! I am trying two versions to store register’s value to global memory. Not sure why the first version is faster? Thank you!!!

“Objection, your honor, calls for speculation!” “Code or it did not happen.”

We do not know (1) what you are measuring (2) how you are measuring (3) how big the difference between the two variants is.

When faced with a question like this, you would want to get in the habit of firing up the CUDA profiler to look at various metrics related to the memory hierarchy. If there is a significant reproducible performance difference, there should also be a significant difference in at least one of these metrics.

oh, thank you!! Maybe my question can be:

register write to shared memory, spend how many cycles?
shared memory write to global memory, spend how many cycles?
register write to global memory, spand how many cycles?

I just find very few info for this…I have heard register spend 0 cycle(!!!) to visit? shared memory 20-30 cycles and global 100 cycles or more? Load or store? Those info is quite faint…

Thank you again for your reply and attention!!!

This information isn’t documented (by NVIDIA) or specified anywhere. It will vary based on GPU and GPU architecture, and to some degree may vary based on the actual activity on the GPU. Remember that the GPU is first a latency-hiding, throughput machine. These latencies don’t matter if the GPU is busy doing other work at the same time.

This will typically be on the order of 20 cycles.

There isn’t any such instruction in the GPU. The transfer process would be shared memory to register, then register to global. Shared memory to register would also be on the order of 20 cycles.

This would be typically on the order of 100-400 cycles.

You can get some information by studying benchmark papers. Here is an example. For example, referring to table 3.1 in that paper, under the section “Shared”, the no-conflict latency for TU104 is measured at 19 cycles. That should correspond to either register->shared or shared->register.

1 Like

This would be typically on the order of 100-400 cycles.

It will depend on whether that memory address is in L2 cache or not. And one should also mention that the address could actually be mapped to other devices and/or the host, in which case the write might take longer than that, IIANM.

1 Like

The latency of writes to global memory is going to vary by DRAM type: DDR6, DDR6X, or HBM2. There will be some variation depending on whether ECC is used or not. It is further going to vary by the operating frequency of the GPU memory interface, which is why it is a best practice to state the latency in nanoseconds (ns), rather than clock cycles.

I have not looked at published data for this very recently, but best I recollect, the latency of global memory access on GPUs is roughly on the order of 250 ns to 300 ns, versus about 80 ns for a x86 CPU accessing DDR4 system memory. Generally speaking, the memory latency of recent GPU architectures (say, Pascal and up) has decreased compared to early GPU architectures. A similar trend occurred in x86 CPUs earlier but stopped about the time DDR3 became dominant (around 2010) and latency has since stagnated or even increased slightly. Given the current state of semiconductor technology, I would be surprised if significant latency improvements were achieved for last-level memory coupled to either CPUs or GPUs in the near future.

As @Robert_Crovella points out, write latency to a GPU’s global memory should not matter (to first order) for application performance. In a throughput architecture like GPUs, writes are basically “fire & forget”.

1 Like

oh, you mean I can let GPU do computation when it is write to global memory! Thank you!!! Great idea! Also many thanks to your detailed info!!!

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.