cudaHostAlloc can only allocate about 3.5GB of memory out of 128GB

I wrote a for loop that continuously allocates 16MB of storage using the function cudaHostAlloc.:
cudaHostAlloc(&h_ptr, 16 * 1024 * 1024, cudaHostAllocDefault)

Only about 3.5GB can be allocated in total.

OS: Ubuntu 18.04.5 LTS
Memory information:

MemTotal:       131923304 kB
MemFree:        126884536 kB
MemAvailable:   128642668 kB
Buffers:          123244 kB
Cached:          2285032 kB
SwapCached:            0 kB
Active:          1090336 kB
Inactive:        1392524 kB
Active(anon):      75800 kB
Inactive(anon):     1804 kB
Active(file):    1014536 kB
Inactive(file):  1390720 kB
Unevictable:           0 kB
Mlocked:               0 kB
SwapTotal:       2097148 kB
SwapFree:        2097148 kB
Dirty:               208 kB
Writeback:             0 kB
AnonPages:         74756 kB
Mapped:            81004 kB
Shmem:              2976 kB
Slab:            1348176 kB
SReclaimable:     379508 kB
SUnreclaim:       968668 kB
KernelStack:       21200 kB
PageTables:       413804 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:    68058800 kB
Committed_AS:   210225332 kB
VmallocTotal:   34359738367 kB
VmallocUsed:           0 kB
VmallocChunk:          0 kB
HardwareCorrupted:     0 kB
AnonHugePages:         0 kB
ShmemHugePages:        0 kB
ShmemPmdMapped:        0 kB
CmaTotal:              0 kB
CmaFree:               0 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
DirectMap4k:     2329600 kB
DirectMap2M:    12204032 kB
DirectMap1G:    120586240 kB

cuda tool kit version:

nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Mon_Apr__3_17:16:06_PDT_2023
Cuda compilation tools, release 12.1, V12.1.105
Build cuda_12.1.r12.1/compiler.32688072_0

The driver is installed together with the CUDA toolkit.
Is this due to some limitations in Linux that only such a large amount of storage space can be allocated?

maybe there is a bug in the portion of code you haven’t shown.

how did you determine that?

I didn’t seem to have any trouble allocating 128GB on a server with 512GB:

$ cat t21.cu
#include <iostream>

const int ls = 8192;
const int ms = 16*1024*1024;
float *d[ls];

int main(){

  for (int i = 0; i < ls; i++) {
    cudaError_t err = cudaHostAlloc(d+i, ms, cudaHostAllocDefault);
    if (err != cudaSuccess) {std::cout << "error at index: " << i << " " << cudaGetErrorString(err) << std::endl; break;}}
}
$ nvcc -o t21 t21.cu
$ time ./t21

real    0m53.250s
user    0m0.404s
sys     0m52.824s
$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 22.04.1 LTS
Release:        22.04
Codename:       jammy
$ cat /proc/meminfo
MemTotal:       527945824 kB
MemFree:        520808416 kB
MemAvailable:   520965648 kB
Buffers:          100488 kB
Cached:          2972088 kB
SwapCached:        24836 kB
Active:          1521388 kB
Inactive:        1624144 kB
Active(anon):       5460 kB
Inactive(anon):    74900 kB
Active(file):    1515928 kB
Inactive(file):  1549244 kB
Unevictable:       31152 kB
Mlocked:           28080 kB
SwapTotal:       8388604 kB
SwapFree:        8275924 kB
Dirty:                 0 kB
Writeback:             0 kB
AnonPages:         97780 kB
Mapped:            81384 kB
Shmem:              6400 kB
KReclaimable:     272180 kB
Slab:            1181028 kB
SReclaimable:     272180 kB
SUnreclaim:       908848 kB
KernelStack:       24176 kB
PageTables:         4276 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:    272361516 kB
Committed_AS:    2118752 kB
VmallocTotal:   34359738367 kB
VmallocUsed:      361808 kB
VmallocChunk:          0 kB
Percpu:           242176 kB
HardwareCorrupted:     0 kB
AnonHugePages:         0 kB
ShmemHugePages:        0 kB
ShmemPmdMapped:        0 kB
FileHugePages:         0 kB
FilePmdMapped:         0 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
Hugetlb:               0 kB
DirectMap4k:     7330852 kB
DirectMap2M:    417990656 kB
DirectMap1G:    111149056 kB
$

CUDA 12.0

My test code:

int main(int argc, char **argv) {
    long avail_pages = sysconf(_SC_AVPHYS_PAGES);
    long page_size = sysconf(_SC_PAGE_SIZE);
    std::cout << "Page size:" << page_size << std::endl;
    std::cout << "Available pages:" << avail_pages << std::endl;
    std::cout << "Available physical memory:" << avail_pages * page_size << std::endl;

    const size_t BYTES = 1024ul * 1024 * 16;
    std::vector<void*> h_ptrs;
    size_t total_bytes = 0;
    while(true) {
        void *h_ptr;
        cudaError e = cudaHostAlloc(&h_ptr, BYTES, cudaHostAllocDefault);
        if (e != cudaSuccess) {
            std::cout << "Bytes allocated by cudaHostAlloc():" << total_bytes << std::endl;
            std::cout << cudaGetErrorString(e) << std::endl;
            break;
        }
        h_ptrs.push_back(h_ptr);
        total_bytes += BYTES;
    }
    for (auto ptr : h_ptrs) {
        cudaFreeHost(ptr);
    }
    return 0;
}

Output:

Page size:4096
Available pages:31820464
Available physical memory:130336620544
Bytes allocated by cudaHostAlloc():3808428032
OS call failed or operation not supported on this OS

It seems something went wrong when call OS API.
I used the strace command to check the input and output of system calls, and the output of the last few lines is as follows:

mmap(0x7ff47d000000, 16777216, PROT_NONE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7ff47d000000
munmap(0x7ff47c000000, 33554432)        = 0
ioctl(4, _IOC(0, 0, 0x22, 0), 0x7ffe70021bf0) = 0
ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x29, 0x10), 0x7ffe70021c80) = 0
mmap(0x7ff47a000000, 16777216, PROT_NONE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7ff47a000000
ioctl(4, _IOC(0, 0, 0x22, 0), 0x7ffe70021bf0) = 0
ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x29, 0x10), 0x7ffe70021c80) = 0
mmap(0x7ff47b000000, 16777216, PROT_NONE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7ff47b000000
munmap(0x7ff47a000000, 33554432)        = 0
ioctl(4, _IOC(0, 0, 0x22, 0), 0x7ffe70021bf0) = 0
ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x29, 0x10), 0x7ffe70021c80) = 0
mmap(0x7ff478000000, 16777216, PROT_NONE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7ff478000000
ioctl(4, _IOC(0, 0, 0x22, 0), 0x7ffe70021bf0) = 0
ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x29, 0x10), 0x7ffe70021c80) = 0
mmap(0x7ff479000000, 16777216, PROT_NONE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7ff479000000
munmap(0x7ff478000000, 33554432)        = 0
ioctl(4, _IOC(0, 0, 0x22, 0), 0x7ffe70021bf0) = 0
ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x29, 0x10), 0x7ffe70021c80) = 0
mmap(0x7ff476000000, 16777216, PROT_NONE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7ff476000000
munmap(0x7ff476000000, 33554432)        = 0
exit_group(0)                           = ?
+++ exited with 0 +++

Based on the return values, it seems that each system call was successfully executed.
In addition, I tested on a server with 8 3080 cards.

I don’t find it helpful when people chop off the include files so that I can spend some of my time trying to guess at what those were.

I already know from experience that if I use an unlimited scheme like you have here to allocate pinned memory on this system, that the system becomes unstable as it tries to allocate nearly all the memory as pinned.

However if I limit the extent of the pinning, I can pin 128GB with the code you have shown.

$ cat t23.cu
#include <iostream>
#include <vector>

int main(int argc, char **argv) {
#if 0
    long avail_pages = sysconf(_SC_AVPHYS_PAGES);
    long page_size = sysconf(_SC_PAGE_SIZE);
    std::cout << "Page size:" << page_size << std::endl;
    std::cout << "Available pages:" << avail_pages << std::endl;
    std::cout << "Available physical memory:" << avail_pages * page_size << std::endl;
#endif

    const size_t BYTES = 1024ul * 1024 * 16;
    std::vector<void*> h_ptrs;
    size_t total_bytes = 0;
    while(true) {
        void *h_ptr;
        cudaError e = cudaHostAlloc(&h_ptr, BYTES, cudaHostAllocDefault);
        if (e != cudaSuccess) {
            std::cout << "Bytes allocated by cudaHostAlloc():" << total_bytes << std::endl;
            std::cout << cudaGetErrorString(e) << std::endl;
            break;
        }
        h_ptrs.push_back(h_ptr);
        if (h_ptrs.size() > 8192) {std::cout << "128 GB" << std::endl; break;}
        total_bytes += BYTES;
    }
    for (auto ptr : h_ptrs) {
        cudaFreeHost(ptr);
    }
    return 0;
}

$ time ./t23
128 GB

real    0m51.639s
user    0m0.535s
sys     0m51.001s
$

So there appears to be something different about your linux or system than mine, I don’t have any further comments or suggestions. It seems that others have run into something similar.

It’s strange, after rebooting the system, everything returned to normal. I can allocate almost all of the remaining free memory in the system through the cudaHostAlloc function.

Keep in mind that cudaHostAlloc provides pinned pages, i.e. it tries to allocate physically contiguous memory suitable for DMA transfers.

A host system that has been in operation for a significant period of time is likely to experience memory fragmentation with regard to the physical address space, making it hard to impossible to allocate large physically contiguous chunks of memory. A freshly booted system on the other hand has minimal memory fragmentation.

That would be my working hypothesis as to what is going on here. cudaHostAlloc() is a thin wrapper around OS API calls, and as a consequence is completely at the mercy of the operation system’s memory allocators. I do not think it is possible to predict how fast system memory becomes fragmented. There may be OS configuration settings that have an impact on this issue, but if so, I would not know what they are. You might want to investigate.

In fact, I just reinstalled the latest version of the CUDA toolkit before the test, and the system was rebooted because the driver was reinstalled. After that, I didn’t run any programs that required frequent memory allocation and release. Moreover, in the test, I only allocated 16MB of pinned memory each time. It should not be the case that after only allocating over 3GB, I could no longer find continuous 16MB storage space. After all, the total memory size is as high as 128GB. After checking the output of strace, I also didn’t find any system calls that failed. I am more inclined to believe that some information maintained by the driver may be erroneous, so in the internal logic judgement, it believes that it can no longer allocate memory and returns an error. When the system is rebooted, the driver gets the correct system information again, and cudaHostAlloc can then execute normally. Of course, I’m not an OS or driver expert, the above is just my speculation.

If you think you have evidence of a bug in the CUDA software stack, and you have a reproducer, you may want to file a bug with NVIDIA.