Pinned memory limit

Cui · October 8, 2015, 8:09pm

Hi,

I find when I allocate pinned memory using cudaMallocHost(), I can get only 4 GB memory, and I get “unknown errors” when I try to allocate more memory. My machine has 128 GB physical memory (yes, 128 GB, and I can allocate that much memory using malloc).

My GPU is Tesla K20C, and I have verified that my GPU architecture is sm_3.5 from cudaGetDeviceProperties().

I think I should be able to allocate more pinned memory, and I don’t understand why it fails.

Here’s the cudaMallocHost code that I’m running:

#include <iostream>
#include <assert.h>

#include <glog/logging.h>

#include <cuda.h>
#include <cuda_runtime.h>

#include <sys/mman.h>

#define CUDA_CHECK(condition) \
  /* Code block avoids redefinition of cudaError_t error */ \
  do { \
    cudaError_t error = condition; \
    CHECK_EQ(error, cudaSuccess) << " " << cudaGetErrorString(error); \
  } while (0)

using namespace std;

int main() {
  cudaDeviceProp deviceProp;
  int devID = 0;
  CUDA_CHECK(cudaGetDeviceProperties(&deviceProp, devID));
  cout << "deviceProp.major = " << deviceProp.major << endl;
  cout << "deviceProp.minor = " << deviceProp.minor << endl;

  size_t count = 1000;
  size_t size = 100 * 1000 * 1000 * sizeof(float);
  for (size_t i = 0; i < count; i++) {
    cout << i << endl;
    void *host_array;
    CUDA_CHECK(cudaMallocHost(&host_array, size));
    // CHECK(host_array = malloc(size));
    // CHECK_EQ(mlock(host_array, size), 0);
  }
}

njuffa · October 8, 2015, 8:15pm

As the name cudaMallocHost() hints, this is just a thin wrapper around your operating system’s API calls for pinning memory. The GPU in the system does not matter, what matters is the OS and any limits it may impose on allocating pinned memory. What operating system are you running on your system? You may want to consult the documentation for it.

If you are on Linux, try running ulimit -l which should report the per-process limit on locked memory.

Robert_Crovella · October 8, 2015, 8:26pm

I have a cc3.5 device on a RHEL 6.2 system with 64GB of system memory and CUDA 7. Your code seems to run fine on that system up to about an iteration count of 160, at which point I think the system memory is all consumed in pinning. Each iteration should pin 400M bytes, so that seems about right, if my arithmetic is correct.

I suspect the issue may be specific to your setup.

Also note that it’s recommended to use cudaHostAlloc instead of cudaMallocHost, but that’s not related to this issue, I don’t think (your code runs fine for me with cudaMallocHost).

Cui · October 9, 2015, 1:24pm

Hi, I’m using Ubuntu 14. I have tried setting ulimit -l'. Previously, the limit was 64 (memlock - max locked-in-memory address space (KB)), so it was only 64 KB memory, but I was still able to allocate 4 GB of pinned memory using cudaMallocHost(). That completely confuses me. Then I set the limit to unlimited’, and I can still get only 4 GB. I’m wondering whether this memlock limit is really enforced by my OS.

Actually I have also tried doing malloc() and mlock() to test the maximum size of the memlocked memory I can get, which is line 33 and line 34 in my code. I find I can memlock much more than 4 GB of memory. Is cudaMallocHost() supposed to do the same thing as the memlock() does on the OS?

By the way, my CUDA toolkit version is 6.5.

Thanks,
Cui

Cui · October 9, 2015, 1:27pm

Are there any ways that I can tell whether the cudaMallocHost() call failed because of my OS or the CUDA library? I was testing my OS using malloc() and memlock(), which are line 33 and line 34 in my code, and I can get much more than 4 GB of memlocked memory. Do you think cudaMallocHost() does the same thing as malloc+memlock does?

Thanks,
Cui

Cui · October 9, 2015, 1:37pm

I did a test again, that ulimit -l' does affect the amount of memory I can memlock(), but it does not affect cudaMallocHost(). When I set ulimit -l’ back to 64, my memlock() fails at the first iteration, but cudaMallocHost() can still run 10 iterations, which is 4 GB memory.

Thanks,
Cui

Robert_Crovella · October 9, 2015, 1:49pm

No I don’t think the CUDA pinning operation is actually using memlock. I don’t know for sure, but I suspect it is a different OS mechanism as suggested here:

http://stackoverflow.com/questions/26888890/cuda-and-pinned-page-locked-memory-not-page-locked-at-all

What is the actual error string returned from the cudaMallocHost call when it fails in your case?

Cui · October 9, 2015, 2:47pm

The error string is `unknown error’, and the error code is 30.

I tried running my program with strace, and here’s what I got:

mmap(0x204600000, 400003072, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_FIXED|MAP_ANONYMOUS, 0, 0) = 0x204600000
ioctl(5, 0xc0304627, 0x7fff17156c70)    = 0
ioctl(3, 0xc0384657, 0x7fff17156d00)    = 0
write(1, "1\n", 21
)                      = 2
mmap(0x21c380000, 400003072, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_FIXED|MAP_ANONYMOUS, 0, 0) = 0x21c380000
ioctl(5, 0xc0304627, 0x7fff17156c70)    = 0
ioctl(3, 0xc0384657, 0x7fff17156d00)    = 0
write(1, "2\n", 22
)                      = 2
mmap(0x234100000, 400003072, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_FIXED|MAP_ANONYMOUS, 0, 0) = 0x234100000
ioctl(5, 0xc0304627, 0x7fff17156c70)    = 0
ioctl(3, 0xc0384657, 0x7fff17156d00)    = 0
write(1, "3\n", 23
)                      = 2
mmap(0x24be80000, 400003072, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_FIXED|MAP_ANONYMOUS, 0, 0) = 0x24be80000
ioctl(5, 0xc0304627, 0x7fff17156c70)    = 0
ioctl(3, 0xc0384657, 0x7fff17156d00)    = 0
write(1, "4\n", 24
)                      = 2
mmap(0x263c00000, 400003072, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_FIXED|MAP_ANONYMOUS, 0, 0) = 0x263c00000
ioctl(5, 0xc0304627, 0x7fff17156c70)    = 0
ioctl(3, 0xc0384657, 0x7fff17156d00)    = 0
write(1, "5\n", 25
)                      = 2
mmap(0x27b980000, 400003072, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_FIXED|MAP_ANONYMOUS, 0, 0) = 0x27b980000
ioctl(5, 0xc0304627, 0x7fff17156c70)    = 0
ioctl(3, 0xc0384657, 0x7fff17156d00)    = 0
write(1, "6\n", 26
)                      = 2
mmap(0x293700000, 400003072, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_FIXED|MAP_ANONYMOUS, 0, 0) = 0x293700000
ioctl(5, 0xc0304627, 0x7fff17156c70)    = 0
ioctl(3, 0xc0384657, 0x7fff17156d00)    = 0
write(1, "7\n", 27
)                      = 2
mmap(0x2ab480000, 400003072, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_FIXED|MAP_ANONYMOUS, 0, 0) = 0x2ab480000
ioctl(5, 0xc0304627, 0x7fff17156c70)    = 0
ioctl(3, 0xc0384657, 0x7fff17156d00)    = 0
write(1, "8\n", 28
)                      = 2
mmap(0x2c3200000, 400003072, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_FIXED|MAP_ANONYMOUS, 0, 0) = 0x2c3200000
ioctl(5, 0xc0304627, 0x7fff17156c70)    = 0
ioctl(3, 0xc0384657, 0x7fff17156d00)    = 0
write(1, "9\n", 29
)                      = 2
mmap(0x2daf80000, 400003072, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_FIXED|MAP_ANONYMOUS, 0, 0) = 0x2daf80000
ioctl(5, 0xc0304627, 0x7fff17156c70)    = 0
ioctl(3, 0xc0384657, 0x7fff17156d00)    = 0
write(1, "10\n", 310
)                     = 3
mmap(0x2f2d00000, 400003072, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_FIXED|MAP_ANONYMOUS, 0, 0) = 0x2f2d00000
ioctl(5, 0xc0304627, 0x7fff17156c70)    = 0
mmap(0x2f2d00000, 400003072, PROT_NONE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, 0, 0) = 0x2f2d00000
open("/etc/localtime", O_RDONLY|O_CLOEXEC) = 17
fstat(17, {st_mode=S_IFREG|0444, st_size=2294, ...}) = 0
fstat(17, {st_mode=S_IFREG|0444, st_size=2294, ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2b215bdf5000
read(17, "TZif2\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\4\0\0\0\4\0\0\0\0"..., 4096) = 2294
lseek(17, -1457, SEEK_CUR)              = 837
read(17, "TZif2\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\4\0\0\0\4\0\0\0\0"..., 4096) = 1457
close(17)                               = 0
munmap(0x2b215bdf5000, 4096)            = 0
gettid()                                = 30703
write(2, "WARNING: Logging before InitGoog"..., 65WARNING: Logging before InitGoogleLogging() is written to STDERR
) = 65
write(2, "F1009 10:45:56.689123 30703 test"..., 101F1009 10:45:56.689123 30703 test.cu:49] Check failed: error == cudaSuccess (30 vs. 0)  unknown error
) = 101

It seems that no mmap() or ioctl() calls return errors. But after doing the mmap() call of the 10th iteration, it calls ioctl() only once (it calls it twice in previous iterations).

Do you have any ideas?

Thanks,
Cui

njuffa · October 9, 2015, 2:58pm

I am reasonably sure that CUDA uses mmap() for cudaMallocHost(), as txbob points out. I was unable to remember this yesterday, but now that txbob has jogged my memory … [url]https://devtalk.nvidia.com/default/topic/519635/cuda-programming-and-performance/gpu-utilization-drops-after-consecutive-executions/post/3692299/#3692299[/url].

I think your best course of action is to check the Ubuntu documentation, or inquire on a Linux-specific forum about reasons mmap() could fail for sizes above 4 GB. I imagine it could be a kernel configuration issue, but I am not an OS expert. Checking old forum messages about cudaMallocHost(), I find that other CUDA users have been able to allocate up to 15 GB on a machine with 16 GB of system memory, so there is clearly no general 4 GB limit.

The 4 GB limit is curious because it represents the maximum size representable in a 32-bit data type. But your code above clearly shows you are specifying the size via a ‘size_t’ variable. Searching around the internet, I found a few isolated reports (unrelated to CUDA) of mmap() failing for sizes above 4 GB, in at least one such case the problem went away after a reboot and was not reproducible thereafter.

In practical terms, you may want to re-examine the need to lock vast stretches of system memory. My understanding is that this is usually frowned upon, as it excludes memory from the normal virtual memory mechanism of the operating system. Would it be possible to re-use pinned buffers in a double-buffering scheme, for example?

[Later:] I don’t see anything particularly suspicious in the output from strace, but I agree that the different ioctl() sequence at the point of failure could be a hint. While I think it is unlikely, I do not think a bug in the way CUDA uses the OS API calls can be excluded with certainty, as I have hazy memories of driver bugs that only manifested with very large system memory sizes. I assume you are already using the latest released driver package for your platform?

Cui · October 9, 2015, 7:14pm

njuffa:

In practical terms, you may want to re-examine the need to lock vast stretches of system memory. My understanding is that this is usually frowned upon, as it excludes memory from the normal virtual memory mechanism of the operating system. Would it be possible to re-use pinned buffers in a double-buffering scheme, for example?

[Later:] I don’t see anything particularly suspicious in the output from strace, but I agree that the different ioctl() sequence at the point of failure could be a hint. While I think it is unlikely, I do not think a bug in the way CUDA uses the OS API calls can be excluded with certainty, as I have hazy memories of driver bugs that only manifested with very large system memory sizes. I assume you are already using the latest released driver package for your platform?

In my scenario, I have more data that can fit in GPU memory, so I have to put some in CPU memory. I plan to put them all in the pinned CPU memory, because it’s faster to transfer to GPU memory. However, I find I cannot get more than 4 GB pinned memory, which is even smaller than my GPU memory (5 GB).

I’m using CUDA toolkit 6.5. I guess it’s new enough?

Thanks,
Cui

njuffa · October 9, 2015, 7:45pm

Since no root cause has been established, it is impossible to say whether switching to a newer CUDA version and the latest driver would result in positive change. But given that you seem to be stuck, I would give that a try if I were you.

I still think this problem is more likely to be cause by something on the operating system side and may be Ubuntu specific. My observation over many years is simply that most issues with CUDA on Linux seems to be with that particular Linux variant. Based on my limited understanding, at least some of the issues are due to the fact that the Ubuntu folks frequently like to do things a bit differently. For the record, I will state that I have been an RHEL user for many years, and thus may be biased in my perception. But I can’t even recall the last time I had an issue with RHEL, whether with CUDA or otherwise.

Cui · October 9, 2015, 7:52pm

njuffa:

Since no root cause has been established, it is impossible to say whether switching to a newer CUDA version and the latest driver would result in positive change. But given that you seem to be stuck, I would give that a try if I were you.

I still think this problem is more likely to be cause by something on the operating system side and may be Ubuntu specific. My observation over many years is simply that most issues with CUDA on Linux seems to be with that particular Linux variant. Based on my limited understanding, at least some of the issues are due to the fact that the Ubuntu folks frequently like to do things a bit differently. For the record, I will state that I have been an RHEL user for many years, and thus may be biased in my perception. But I can’t even recall the last time I had an issue with RHEL, whether with CUDA or otherwise.

I just tried installing the latest driver and toolkit (version 7.5), and it still cannot allocate more than 4 GB. So it’s probably not the problem of the driver or the toolkit, and I guess it’s probably not the problem of the hardware. What else is left, is the OS the only possible factor now?

Thanks,
Cui

Robert_Crovella · October 9, 2015, 7:55pm

are you using a vanilla Ubuntu kernel?

What is the result of:

uname -a

?

njuffa · October 9, 2015, 8:01pm

As I mentioned, perusing reports of failed mmap() calls with sizes > 4GB I found at least one that stated the issue was resolved by rebooting the machine and did not reproduce after that. It seems like a slim chance, but you may want to give rebooting a try before embarking on switching Linux versions. Make sure that whatever Linux variant you install is on the list of supported Linux distros for a particular CUDA release, or else all bets are off.

You may also want to seek assistance on an Ubuntu-specific forum: if the problem is not somehow isolated to your specific setup, others should have run into it. I do not have personal experience with systems with more than 32 GB of system memory.

Cui · October 29, 2015, 7:39pm

njuffa:

Since no root cause has been established, it is impossible to say whether switching to a newer CUDA version and the latest driver would result in positive change. But given that you seem to be stuck, I would give that a try if I were you.

I still think this problem is more likely to be cause by something on the operating system side and may be Ubuntu specific. My observation over many years is simply that most issues with CUDA on Linux seems to be with that particular Linux variant. Based on my limited understanding, at least some of the issues are due to the fact that the Ubuntu folks frequently like to do things a bit differently. For the record, I will state that I have been an RHEL user for many years, and thus may be biased in my perception. But I can’t even recall the last time I had an issue with RHEL, whether with CUDA or otherwise.

Hi, sorry for coming back late on this topic. You are right. My problem is solved when I switched to CentOS, though I still don’t know why it doesn’t work on Ubuntu (whether I wasn’t configuring it correctly or it just doesn’t work on Ubuntu).

Thanks,
Cui

Cui · October 29, 2015, 7:40pm

njuffa:

Since no root cause has been established, it is impossible to say whether switching to a newer CUDA version and the latest driver would result in positive change. But given that you seem to be stuck, I would give that a try if I were you.

I still think this problem is more likely to be cause by something on the operating system side and may be Ubuntu specific. My observation over many years is simply that most issues with CUDA on Linux seems to be with that particular Linux variant. Based on my limited understanding, at least some of the issues are due to the fact that the Ubuntu folks frequently like to do things a bit differently. For the record, I will state that I have been an RHEL user for many years, and thus may be biased in my perception. But I can’t even recall the last time I had an issue with RHEL, whether with CUDA or otherwise.

Hi, sorry for coming back late on this topic. You are right. My problem is solved when I switched to CentOS, though I still don’t know why it doesn’t work on Ubuntu (whether I wasn’t configuring it correctly or it just doesn’t work on Ubuntu).

Thanks,
Cui

Cui · May 1, 2016, 11:52pm

I still hope to find out what causes this problem. As some people suggest, I break the cudaMallocHost() call into two calls: posix_memalign() and cudaHostRegister().

Here’s my new testing program:

#include <iostream>
#include <assert.h>

#include <cuda.h>
#include <cuda_runtime.h>

using namespace std;

int main() {
  size_t count = 0;
  size_t size = 64 * 1024 * 1024 * sizeof(float);
  while (true) {
    void *ptr;
    while (posix_memalign(&ptr, 4096, size) != 0) {
      cout << "posix_memalign failed at " << count * 256 << " MB" << endl;
    }
    cout << "ptr = " << ptr << endl;
    while (cudaHostRegister(ptr, size, 0) != cudaSuccess) {
      cout << "cudaHostRegister failed at " << count * 256 << " MB" << endl;
    }
    count++;
    cout << "Allocated " << count * 256 << " MB" << endl;
  }
}

The program always fails at the cudaHostRegister() call when the allocated memory approaches 4 GB. I’m using Ubuntu 14.04 with CUDA 7.5.

Do you have any ideas how the cudaHostRegister() is implemented? I really hope I can write a simple program to emulate what cudaHostRegister(), but without using CUDA APIs, so that I can find out whether it’s a Ubuntu issue or CUDA issue.

Thank you!
Cui

Topic		Replies	Views
cudaFreeHost very slow for buffers > 2GB [newbee here!] CUDA Programming and Performance	30	2516	October 3, 2019
cudaHostAlloc can only allocate about 3.5GB of memory out of 128GB CUDA Programming and Performance	7	457	June 2, 2023
cudaMalloc error in big loop CUDA Programming and Performance	12	15614	May 21, 2008
using cudaMalloc and cudaFree within a loop unspecified launch failure! CUDA Programming and Performance	21	37702	April 23, 2009
Can a Kernel be too big?? CUDA_ERROR_NO_BINARY_FOR_GPU error 209 CUDA Programming and Performance	11	3047	November 13, 2017
cudaHostRegister crash or poor performance unknow error (30) in kernel for 64bit host operating syst CUDA Programming and Performance	23	5659	May 8, 2012
cudaMemset: illegal memory access with RTX5090 with 570.86.16 CUDA Programming and Performance llama	16	266	June 5, 2025
Cudamalloc attempting to allocate more memory than it is supposed to CUDA Programming and Performance cuda	15	222	January 13, 2025
Problems with cudaHostAlloc and cudaMemcpyAsync CUDA Programming and Performance	5	4515	February 8, 2010
[Multiple GPUs / Processes] CUDA Memory De/Allocation Slow CUDA Programming and Performance	25	9613	December 4, 2017

Pinned memory limit

Related topics