Invalid argument error when using cudaHostRegister: How to properly pin host memory for CUDA
Below is a simple example code that loads a large .npy data file into RAM and attempts to pin it using cudaHostRegister.
This approach works correctly on my RTX 4060 system (CUDA 12.6 + Driver 560.35.03), but fails with an invalid argument error on the newer RTX 5090 system (CUDA 13.0 + Driver 580.65.06).
Why?
int main() {
// 1. load npy data to the host
cnpy::NpyArray arr = cnpy::npy_load("data.npy");
size_t total_size = 1;
for (size_t dim : arr.shape) {
total_size *= dim;
}
const std::complex<double>* complex_data = arr.data<std::complex<double>>();
std::vector<cufftDoubleComplex> vec(total_size); // ~7GB data
for (size_t i = 0; i < total_size; i++) {
vec[i] = make_cuDoubleComplex(complex_data[i].real(), complex_data[i].imag());
}
// 2. pinned
CHECK_CUDA(cudaHostRegister(vec.data(), total_size * sizeof(cufftDoubleComplex), cudaHostRegisterPortable));
return 0;
}
Two different systems may end up with different possible sizes of memory that can be pinned. The host system installed memory size, (the operating system, and) the linux kernel and kernel version, and possibly other factors may influence this, including GPU driver version.
Through trial and error, I found that even though the host RAM is much larger than the file size, pinning 6.7 GB of data at once on the 5090 causes problems (while it works fine on the 4060). It might be necessary to use segmented cudaHostRegister calls to work around this limitation. Is my assumption reasonable?
int device;
cudaGetDevice(&device);
cudaDeviceProp prop;
cudaGetDeviceProperties(&prop, device);
printf("Device: %s, Compute Capability: %d.%d\n", prop.name, prop.major, prop.minor);
printf("Max pinned memory: %zu MB\n", prop.totalGlobalMem / (1024 * 1024));
Device: NVIDIA GeForce RTX 5090, Compute Capability: 12.0
Max pinned memory: 32111 MB
In addition, I found that the maximum pinnable memory is 32 GB, which is much larger than my file size (~7 GB), so I have no idea what’s going on either.
It will probably require more trial and error to find out the limiting factor. I would start by comparing the OS between the two machines. Identical (including e.g. uname -a output)? If they are not identical it could be a factor. next I would probably move the 4090 machine to a CUDA and driver version that is identical to the 5090 machine. Does this result in reduced pinnable memory?
Even if you get through all that, it may or may not be useful. In general, there are no obvious user-ready controls to modify this behavior.
That doesn’t make any sense to me. Calling that device property “Max pinned memory” isn’t supported by any documentation I am aware of.