cudaDeviceSetLimit on V100 w/32GB limited at ~17GB?

CUDA version 10.1
Driver version 418.87.01
Cuda compilation tools, release 10.1, V10.1.105

I’m trying to be able to malloc a large table on the device.

If I try to call cudaDeviceSetLimit for more that 17GB on a V100 w/ 32GB is only seems to clamp the value to 17,681,179,680.

Example:

size_t setsz = 20LL*1024LL*1024LL*1024LL;
size_t getsz;
cudaDeviceSetLimit(cudaLimitMallocHeapSize, setsz);
cudaDeviceGetLimit(&getsz, cudaLimitMallocHeapSize);
printf("requested %ld got %ld\n", setsz, getsz);

requested 21474836480 got 17681179680

I’m guessing there’s something simple that I’m missing, but it’s totally escaping me…

Is the OS platform Windows by any chance? If so, are you running with the TCC driver? Best I know there are 16 GB and 32 GB versions of V100, I assume you have verified by other means that you have the 32 GB variant?

CentOS 7.7.1908
4.4.186-1.el7.elrepo.x86_64 kernel

nvidia-smi shows 32GB so does cudaMemGetInfo.

Is this directly after booting the system, with only a single user on the machine? Does nvidia-smi -q show other processes using the GPU besides your apps?

Machine’s been up for days, multiple users on and off. There are other cards in the host and being used, but I’m using CUDA_VISIBLE_DEVICES and according to nvidia-smi I’m the only one on a particular card.

I think there is probably an undocumented limit here that you are hitting.

My suggestion would be to file a bug using the information linked in a sticky post at the top of this sub-forum.

Done.

The explanation I received is as follows:

the documentation for cudaDeviceSetLimit:

https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__DEVICE.html#group__CUDART__DEVICE_1g05956f16eaa47ef3a4efee84563ccb7d

says:

" The driver is free to modify the requested value to meet h/w requirements (this could be clamping to minimum or maximum values, rounding up to nearest element size, etc). The application can use cudaDeviceGetLimit() to find out exactly what the limit has been set to."

Therefore there is no guarantee that the request will be honored. No error code will be returned if it is not honored.

If you want confirmation, do a cudaDeviceGetLimit immediately after the set limit call, to confirm.

And, FWIW, 17GB (actually just a shade over 16GB) is about the maximum for a 32GB tesla V100, although that is an unpublished spec, subject to change, and I can’t give any technical details as to why it is that number.

Thanks. It still seem to me that it’s a bug if you can’t set it to access almost roughly 15GB of memory. ;^)

For those who might be having the same issue the work around seems to be: Instead of malloc-ing on the device side you have to cudaMalloc on the host side and pass the pointer to the kernel. This makes for more code/work on the user’s part (i.e. do your own memory management out of the pool you create) but cudaMalloc does know how to allow access to more than 17GB.

Furthermore, a host-side cudaMalloc creates an allocation that is accessible from the host side API (e.g. cudaMemcpy). Device side in-kernel new/malloc which come from the device heap (the limit being discussed here), are not accessible to host-side API.

The developer looked at it and said its expected behavior. I don’t have any details beyond that.