Off-chip memory access

I have a few questions about memory access to the GPU’s off-chip memory.

  1. when programming using CUDA, I can use the “memcpy” function to copy data to the CPU or GPU.
    To do this, I need to use the PCIe controller and the DMA engine (I don’t know where it is :p).
    Can you elaborate on how the memcpy function works and the data flow when it works?

  2. If the host memory is used in a unified memory way, the data allocated in the host memory can be directly used by the GPU.
    I understand that in this case, a migration of data occurs.
    2.1 I was wondering if this migration only works for a certain data size or on a per-page basis?
    2.2 If the migration works on a per-page basis, does the GPU have an MMU for managing pages, or does it use a host-side MMU?
    2.3 Is there a way to migrate data and not store it in off-chip memory, but instead pass it over PCIe and then immediately pass it to the pipeline that needs it? I’m wondering if it doesn’t exist and why it hasn’t been created.

A cudaMemcpy____ operation issued in host code, for transfer of data between host and device, will generally use a DMA controller. I don’t think this is fully described anywhere. The basic principle is that when using pinned memory as the source or destination, the operation will program a DMA controller with start and end addresses of the transfer, and then turn control of the transfer over to the DMA controller. At some point, the DMA controller will issue PCIE bus cycles to move the data.

When the transfer of data is from pageable host memory, the mechanics are a bit different, but I’m fairly confident the DMA controller is still involved. I don’t wish to describe the basic concept of a DMA controller. If you need that level of description, google the terms, or read up on how an intel 8257 device used to work. The basic concept has not changed. It offloads the CPU by issuing memory traffic from a dedicated piece of programmable hardware.

Some description of differences between pageable and pinned transfers is here

There are no particular limits to the size of a migratable object, other than limits you would infer from host/system memory size. Device memory in a demand-paged environment can be oversubscribed in the UM system. The demand-paged transfers themselves happen on a per-page basis, just as they do in a modern host operating system such as windows or linux.

Yes, a Pascal or newer GPU has its own MMU for managing pages. Demand-paged migration of data will typically involve two MMUs: one at the target and one at the source.

The UM system only migrates data within the global space. That means data from host memory will end up in device memory or vice versa (or in another GPU device memory).

To consume data directly from host memory “to the pipeline that needs it” without a trip through device memory, a programming model for that exists already - it is through the use of pinned memory directly accessed by GPU device code, a technique which has been referred to in CUDA lore as “zero copy”.

Additional information that may be of interest on these concepts is available in unit 6 of this online training series.

Somewhere in the literature you will find a statement like:
“All accesses to global memory go through L2, including copies to/from CPU host.”
Also see this post.

I am not an expert, and my understanding may be dated, but I think of it this way:

When the on-chip memory (L2) needs to be used for something else, the chip writes L2 pages to off-chip memory to make room for that something else.

So the use of off-chip memory is determined by L2 utilization, not by the type of hardware that is the source or destination of the data.

Also bear in mind that it takes kernels hundreds of clocks to access L2.

I got the numbers below from various sources, and they are not gospel, but they are the ones I refer to when thinking about various latencies in the device:

Ballpark GPU Latencies                      min   max 
  typical instruction (register read)         6     6  clocks
  shared memory load                         20    30  clocks
  read L1 cache                              30    80  clocks
  read L2 cache via L1                      190   250  clocks
  read from off-chip memory via L1 and L2   480   680  clocks

I can update this table if someone has numbers more representative of modern devices.

It may be convenient to ignore off-chip memory and think instead of the global memory space, which is a logical concept. Next, consider how your code affects L2. E.g., many scattered small accesses to global memory space will make inefficient use of L2. Or, if you have a queue in global memory that is much larger than it needs to be, it will be less efficient than a small queue whose data could remain resident in L2. Etc.

Hi. Most of these numbers up to Turing may be found in Table3.1 here.


I updated some numbers in my table based on Table 3.1 in the paper.

My numbers for reads from off-chip memory are consistent with Figure 3.5 in the paper.

Thanks for your reply!