Pinned memory limit

Hi,

I find when I allocate pinned memory using cudaMallocHost(), I can get only 4 GB memory, and I get “unknown errors” when I try to allocate more memory. My machine has 128 GB physical memory (yes, 128 GB, and I can allocate that much memory using malloc).

My GPU is Tesla K20C, and I have verified that my GPU architecture is sm_3.5 from cudaGetDeviceProperties().

I think I should be able to allocate more pinned memory, and I don’t understand why it fails.

Here’s the cudaMallocHost code that I’m running:

#include <iostream>
#include <assert.h>

#include <glog/logging.h>

#include <cuda.h>
#include <cuda_runtime.h>

#include <sys/mman.h>

#define CUDA_CHECK(condition) \
  /* Code block avoids redefinition of cudaError_t error */ \
  do { \
    cudaError_t error = condition; \
    CHECK_EQ(error, cudaSuccess) << " " << cudaGetErrorString(error); \
  } while (0)

using namespace std;

int main() {
  cudaDeviceProp deviceProp;
  int devID = 0;
  CUDA_CHECK(cudaGetDeviceProperties(&deviceProp, devID));
  cout << "deviceProp.major = " << deviceProp.major << endl;
  cout << "deviceProp.minor = " << deviceProp.minor << endl;

  size_t count = 1000;
  size_t size = 100 * 1000 * 1000 * sizeof(float);
  for (size_t i = 0; i < count; i++) {
    cout << i << endl;
    void *host_array;
    CUDA_CHECK(cudaMallocHost(&host_array, size));
    // CHECK(host_array = malloc(size));
    // CHECK_EQ(mlock(host_array, size), 0);
  }
}

As the name cudaMallocHost() hints, this is just a thin wrapper around your operating system’s API calls for pinning memory. The GPU in the system does not matter, what matters is the OS and any limits it may impose on allocating pinned memory. What operating system are you running on your system? You may want to consult the documentation for it.

If you are on Linux, try running ulimit -l which should report the per-process limit on locked memory.

I have a cc3.5 device on a RHEL 6.2 system with 64GB of system memory and CUDA 7. Your code seems to run fine on that system up to about an iteration count of 160, at which point I think the system memory is all consumed in pinning. Each iteration should pin 400M bytes, so that seems about right, if my arithmetic is correct.

I suspect the issue may be specific to your setup.

Also note that it’s recommended to use cudaHostAlloc instead of cudaMallocHost, but that’s not related to this issue, I don’t think (your code runs fine for me with cudaMallocHost).

Hi, I’m using Ubuntu 14. I have tried setting ulimit -l'. Previously, the limit was 64 (memlock - max locked-in-memory address space (KB)), so it was only 64 KB memory, but I was still able to allocate 4 GB of pinned memory using cudaMallocHost(). That completely confuses me. Then I set the limit to unlimited’, and I can still get only 4 GB. I’m wondering whether this memlock limit is really enforced by my OS.

Actually I have also tried doing malloc() and mlock() to test the maximum size of the memlocked memory I can get, which is line 33 and line 34 in my code. I find I can memlock much more than 4 GB of memory. Is cudaMallocHost() supposed to do the same thing as the memlock() does on the OS?

By the way, my CUDA toolkit version is 6.5.

Thanks,
Cui

Are there any ways that I can tell whether the cudaMallocHost() call failed because of my OS or the CUDA library? I was testing my OS using malloc() and memlock(), which are line 33 and line 34 in my code, and I can get much more than 4 GB of memlocked memory. Do you think cudaMallocHost() does the same thing as malloc+memlock does?

Thanks,
Cui

I did a test again, that ulimit -l' does affect the amount of memory I can memlock(), but it does not affect cudaMallocHost(). When I set ulimit -l’ back to 64, my memlock() fails at the first iteration, but cudaMallocHost() can still run 10 iterations, which is 4 GB memory.

Thanks,
Cui

No I don’t think the CUDA pinning operation is actually using memlock. I don’t know for sure, but I suspect it is a different OS mechanism as suggested here:

http://stackoverflow.com/questions/26888890/cuda-and-pinned-page-locked-memory-not-page-locked-at-all

What is the actual error string returned from the cudaMallocHost call when it fails in your case?

The error string is `unknown error’, and the error code is 30.

I tried running my program with strace, and here’s what I got:

mmap(0x204600000, 400003072, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_FIXED|MAP_ANONYMOUS, 0, 0) = 0x204600000
ioctl(5, 0xc0304627, 0x7fff17156c70)    = 0
ioctl(3, 0xc0384657, 0x7fff17156d00)    = 0
write(1, "1\n", 21
)                      = 2
mmap(0x21c380000, 400003072, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_FIXED|MAP_ANONYMOUS, 0, 0) = 0x21c380000
ioctl(5, 0xc0304627, 0x7fff17156c70)    = 0
ioctl(3, 0xc0384657, 0x7fff17156d00)    = 0
write(1, "2\n", 22
)                      = 2
mmap(0x234100000, 400003072, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_FIXED|MAP_ANONYMOUS, 0, 0) = 0x234100000
ioctl(5, 0xc0304627, 0x7fff17156c70)    = 0
ioctl(3, 0xc0384657, 0x7fff17156d00)    = 0
write(1, "3\n", 23
)                      = 2
mmap(0x24be80000, 400003072, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_FIXED|MAP_ANONYMOUS, 0, 0) = 0x24be80000
ioctl(5, 0xc0304627, 0x7fff17156c70)    = 0
ioctl(3, 0xc0384657, 0x7fff17156d00)    = 0
write(1, "4\n", 24
)                      = 2
mmap(0x263c00000, 400003072, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_FIXED|MAP_ANONYMOUS, 0, 0) = 0x263c00000
ioctl(5, 0xc0304627, 0x7fff17156c70)    = 0
ioctl(3, 0xc0384657, 0x7fff17156d00)    = 0
write(1, "5\n", 25
)                      = 2
mmap(0x27b980000, 400003072, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_FIXED|MAP_ANONYMOUS, 0, 0) = 0x27b980000
ioctl(5, 0xc0304627, 0x7fff17156c70)    = 0
ioctl(3, 0xc0384657, 0x7fff17156d00)    = 0
write(1, "6\n", 26
)                      = 2
mmap(0x293700000, 400003072, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_FIXED|MAP_ANONYMOUS, 0, 0) = 0x293700000
ioctl(5, 0xc0304627, 0x7fff17156c70)    = 0
ioctl(3, 0xc0384657, 0x7fff17156d00)    = 0
write(1, "7\n", 27
)                      = 2
mmap(0x2ab480000, 400003072, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_FIXED|MAP_ANONYMOUS, 0, 0) = 0x2ab480000
ioctl(5, 0xc0304627, 0x7fff17156c70)    = 0
ioctl(3, 0xc0384657, 0x7fff17156d00)    = 0
write(1, "8\n", 28
)                      = 2
mmap(0x2c3200000, 400003072, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_FIXED|MAP_ANONYMOUS, 0, 0) = 0x2c3200000
ioctl(5, 0xc0304627, 0x7fff17156c70)    = 0
ioctl(3, 0xc0384657, 0x7fff17156d00)    = 0
write(1, "9\n", 29
)                      = 2
mmap(0x2daf80000, 400003072, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_FIXED|MAP_ANONYMOUS, 0, 0) = 0x2daf80000
ioctl(5, 0xc0304627, 0x7fff17156c70)    = 0
ioctl(3, 0xc0384657, 0x7fff17156d00)    = 0
write(1, "10\n", 310
)                     = 3
mmap(0x2f2d00000, 400003072, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_FIXED|MAP_ANONYMOUS, 0, 0) = 0x2f2d00000
ioctl(5, 0xc0304627, 0x7fff17156c70)    = 0
mmap(0x2f2d00000, 400003072, PROT_NONE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, 0, 0) = 0x2f2d00000
open("/etc/localtime", O_RDONLY|O_CLOEXEC) = 17
fstat(17, {st_mode=S_IFREG|0444, st_size=2294, ...}) = 0
fstat(17, {st_mode=S_IFREG|0444, st_size=2294, ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2b215bdf5000
read(17, "TZif2\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\4\0\0\0\4\0\0\0\0"..., 4096) = 2294
lseek(17, -1457, SEEK_CUR)              = 837
read(17, "TZif2\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\4\0\0\0\4\0\0\0\0"..., 4096) = 1457
close(17)                               = 0
munmap(0x2b215bdf5000, 4096)            = 0
gettid()                                = 30703
write(2, "WARNING: Logging before InitGoog"..., 65WARNING: Logging before InitGoogleLogging() is written to STDERR
) = 65
write(2, "F1009 10:45:56.689123 30703 test"..., 101F1009 10:45:56.689123 30703 test.cu:49] Check failed: error == cudaSuccess (30 vs. 0)  unknown error
) = 101

It seems that no mmap() or ioctl() calls return errors. But after doing the mmap() call of the 10th iteration, it calls ioctl() only once (it calls it twice in previous iterations).

Do you have any ideas?

Thanks,
Cui

I am reasonably sure that CUDA uses mmap() for cudaMallocHost(), as txbob points out. I was unable to remember this yesterday, but now that txbob has jogged my memory … https://devtalk.nvidia.com/default/topic/519635/cuda-programming-and-performance/gpu-utilization-drops-after-consecutive-executions/post/3692299/#3692299.

I think your best course of action is to check the Ubuntu documentation, or inquire on a Linux-specific forum about reasons mmap() could fail for sizes above 4 GB. I imagine it could be a kernel configuration issue, but I am not an OS expert. Checking old forum messages about cudaMallocHost(), I find that other CUDA users have been able to allocate up to 15 GB on a machine with 16 GB of system memory, so there is clearly no general 4 GB limit.

The 4 GB limit is curious because it represents the maximum size representable in a 32-bit data type. But your code above clearly shows you are specifying the size via a ‘size_t’ variable. Searching around the internet, I found a few isolated reports (unrelated to CUDA) of mmap() failing for sizes above 4 GB, in at least one such case the problem went away after a reboot and was not reproducible thereafter.

In practical terms, you may want to re-examine the need to lock vast stretches of system memory. My understanding is that this is usually frowned upon, as it excludes memory from the normal virtual memory mechanism of the operating system. Would it be possible to re-use pinned buffers in a double-buffering scheme, for example?

[Later:] I don’t see anything particularly suspicious in the output from strace, but I agree that the different ioctl() sequence at the point of failure could be a hint. While I think it is unlikely, I do not think a bug in the way CUDA uses the OS API calls can be excluded with certainty, as I have hazy memories of driver bugs that only manifested with very large system memory sizes. I assume you are already using the latest released driver package for your platform?

In my scenario, I have more data that can fit in GPU memory, so I have to put some in CPU memory. I plan to put them all in the pinned CPU memory, because it’s faster to transfer to GPU memory. However, I find I cannot get more than 4 GB pinned memory, which is even smaller than my GPU memory (5 GB).

I’m using CUDA toolkit 6.5. I guess it’s new enough?

Thanks,
Cui

Since no root cause has been established, it is impossible to say whether switching to a newer CUDA version and the latest driver would result in positive change. But given that you seem to be stuck, I would give that a try if I were you.

I still think this problem is more likely to be cause by something on the operating system side and may be Ubuntu specific. My observation over many years is simply that most issues with CUDA on Linux seems to be with that particular Linux variant. Based on my limited understanding, at least some of the issues are due to the fact that the Ubuntu folks frequently like to do things a bit differently. For the record, I will state that I have been an RHEL user for many years, and thus may be biased in my perception. But I can’t even recall the last time I had an issue with RHEL, whether with CUDA or otherwise.

I just tried installing the latest driver and toolkit (version 7.5), and it still cannot allocate more than 4 GB. So it’s probably not the problem of the driver or the toolkit, and I guess it’s probably not the problem of the hardware. What else is left, is the OS the only possible factor now?

Thanks,
Cui

are you using a vanilla Ubuntu kernel?

What is the result of:

uname -a

?

As I mentioned, perusing reports of failed mmap() calls with sizes > 4GB I found at least one that stated the issue was resolved by rebooting the machine and did not reproduce after that. It seems like a slim chance, but you may want to give rebooting a try before embarking on switching Linux versions. Make sure that whatever Linux variant you install is on the list of supported Linux distros for a particular CUDA release, or else all bets are off.

You may also want to seek assistance on an Ubuntu-specific forum: if the problem is not somehow isolated to your specific setup, others should have run into it. I do not have personal experience with systems with more than 32 GB of system memory.

Hi, sorry for coming back late on this topic. You are right. My problem is solved when I switched to CentOS, though I still don’t know why it doesn’t work on Ubuntu (whether I wasn’t configuring it correctly or it just doesn’t work on Ubuntu).

Thanks,
Cui

Hi, sorry for coming back late on this topic. You are right. My problem is solved when I switched to CentOS, though I still don’t know why it doesn’t work on Ubuntu (whether I wasn’t configuring it correctly or it just doesn’t work on Ubuntu).

Thanks,
Cui

I still hope to find out what causes this problem. As some people suggest, I break the cudaMallocHost() call into two calls: posix_memalign() and cudaHostRegister().

Here’s my new testing program:

#include <iostream>
#include <assert.h>

#include <cuda.h>
#include <cuda_runtime.h>

using namespace std;

int main() {
  size_t count = 0;
  size_t size = 64 * 1024 * 1024 * sizeof(float);
  while (true) {
    void *ptr;
    while (posix_memalign(&ptr, 4096, size) != 0) {
      cout << "posix_memalign failed at " << count * 256 << " MB" << endl;
    }
    cout << "ptr = " << ptr << endl;
    while (cudaHostRegister(ptr, size, 0) != cudaSuccess) {
      cout << "cudaHostRegister failed at " << count * 256 << " MB" << endl;
    }
    count++;
    cout << "Allocated " << count * 256 << " MB" << endl;
  }
}

The program always fails at the cudaHostRegister() call when the allocated memory approaches 4 GB. I’m using Ubuntu 14.04 with CUDA 7.5.

Do you have any ideas how the cudaHostRegister() is implemented? I really hope I can write a simple program to emulate what cudaHostRegister(), but without using CUDA APIs, so that I can find out whether it’s a Ubuntu issue or CUDA issue.

Thank you!
Cui