cudaMallocHost with large memory failed with invalid argument

Hi,

When I try to allocate a large chunk of host memory with cudaMallocHost, it fails with invalid argument. It can allocate smaller memory, but it seems like there is an upper limit for the allocatable memory. For example, the following mini-testing code,

#include <iostream>

int main() {
  void* pinnedMemory = nullptr;
  cudaError_t cErr;

  for (std::size_t sizeGB = 1; sizeGB <= 10; ++sizeGB) {
    std::size_t nBytes = sizeGB * (1UL << 30); // Convert GB to bytes
    cErr = cudaMallocHost(&pinnedMemory, nBytes);
    if (cErr == cudaSuccess) {
      std::cout << "Successfully allocated " << sizeGB << " GB of pinned memory.\n";
      cudaFreeHost(pinnedMemory);
      pinnedMemory = nullptr;
    } else {
      std::cout << "Failed to allocate " << sizeGB << " GB: "
        << cudaGetErrorString(cErr) << "\n";
      break;
    }
  }
  return 0;
}

results the following:

❯ nvcc test_size.cu -g -O0 -lineinfo && ./a.out
Successfully allocated 1 GB of pinned memory.
Successfully allocated 2 GB of pinned memory.
Failed to allocate 3 GB: invalid argument

And dmesg reports, when the above program exit with a failure,

❯ sudo dmesg | tail -n 20
...
[10732.400239] Cannot map memory with base addr 0x70fba6000000 and size of 0xc0000 pages

I have a sufficient amount of memory both on GPU (10GB) and CPU (64GB).

And the weirdest thing is that, I was able to allocate (at least) more than 6 GB of memory before using the cudaMallocHost, without any problem. I’m not sure if the OS upgrade affects something related to this issue or I need to consider a hardware failure.

I am using the Arch Linux Kernel version 6.11.5, Nvidia driver version is 560.35.03, and the CUDA is 12.6.

❯ uname -r
6.11.5-arch1-1

❯ nvidia-smi
Thu Oct 24 13:45:48 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03              Driver Version: 560.35.03      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3080        Off |   00000000:0E:00.0  On |                  N/A |
|  0%   49C    P5             37W /  370W |    1102MiB /  10240MiB |     10%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

Any help, suggestions, or advice for narrowing down the issue is really welcomed.

I’ve also experienced this error with nearly the exact same system configuration. Did you ever find a resolution to this issue?

I hope this helps someone: I was able to resolve this issue by downgrading the Linux Kernel to version 6.8. In my experiences this issue also resulted in a system-wide memory leak that could not be reclaimed. Avoid 6.11.

It just went away after a system upgrade(s). I’m not sure what was causing the issue, but based on your experiments, it seems like a bug was in the Linux Kernel. Currently, it works fine with Kernel 6.14.1:

❯ uname -r
6.14.1-arch1-1

Just FYI, CUDA version is:

❯ nvidia-smi
Tue May  6 16:16:45 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.133.07             Driver Version: 570.133.07     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3080        Off |   00000000:0E:00.0  On |                  N/A |
|  0%   51C    P8             35W /  370W |    1310MiB /  10240MiB |     39%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

Found this with google-fu today.

Ran into the same issue a few days ago with a rather large code base, with nearly identical systems, except one was Ubuntu 22.04 with kernel 6.5.0-45-generic (no issue) and the other was Ubuntu 24.04 with kernel 6.11.0-26-generic (issue seen)

The repro case posted fails on the 24.04 machine with kernel 6.11.0-26-generic and also leaks. Thanks for making me aware!

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.