Hi,
I’m experimenting with using persistent memory (Intel Optane DCPMM) via /dev/dax0.0 as a large host-memory region for GPU DMA.
Environment
-
GPU: NVIDIA H100 PCIe
-
CUDA: 12.8
-
Host DRAM: 128 GiB
-
Persistent memory: 2 TiB (
/dev/dax0.0) -
Application uses
cudaHostRegister()on a DAX-mapped region
Goal
Check whether a large DAX-mapped region (hundreds of GiB – 1 TiB) can be registered as pinned memory for GPU DMA.
Observation
Even though /dev/dax0.0 is 2 TiB, cudaHostRegister() always fails with cudaErrorMemoryAllocation when the pinned size exceeds about 123.9 GiB.
Binary-search measurement result:
Try cudaHostRegister(0x12d25e000000, 133047517184 bytes) [123.91 GiB]
-> OK
Try cudaHostRegister(0x12d25e000000, 133049614336 bytes) [123.91 GiB]
-> FAIL: out of memory (2)
This almost matches “Host DRAM size – a small system margin”
(128 GiB DRAM – ~4 GiB).
It suggests that CUDA’s internal pinned-memory accounting effectively caps the total pinned memory at approximately the physical DRAM size, regardless of:
-
The virtual-address backing store (DRAM vs DAX persistent memory)
-
The mapping type (
/dev/dax0.0, which has no page cache) -
The availability of large 2 TiB physically contiguous pmem
Questions
-
Is it expected that the CUDA driver restricts the maximum pinned memory size to (approximately) the host DRAM size?
- i.e., even if the memory is backed by DAX (pmem) and not DRAM?
-
Is there any supported way to increase this limit?
- For example, a driver option, kernel parameter, or environment variable.
-
Is registering >128 GiB (or >physical DRAM) fundamentally unsupported in CUDA designs today?
-
If there are future plans for supporting DRAM-bypassing DMA to persistent memory (or CXL.mem), is there any information available?
Any insight would be greatly appreciated.
Thanks.