Allocating memory from device and cudaLimitMallocHeapSize

From our device-side code we would need to allocate a large amount of memory to perform some device-side computations that are not visible to the host. We are using device cudaMalloc to allocate memory and host cudaDeviceSetLimit(cudaLimitMallocHeapSize,TotalDeviceMemorySize), where TotalDeviceMemorySize obtained by calling cuMemGetInfo( & FreeDeviceMemorySize, & TotalDeviceMemorySize );

The cudaDeviceSetLimit does not seem to allocate any device memory, as the cuMemGetInfo still returns the same FreeDeviceMemorySize.

After calling cudaDeviceSetLimit, we are able to continue allocations from both host side and device side for some time, including ability to launch kernels. But soon the kernel launches start failing with CUDA_ERROR_LAUNCH_OUT_OF_RESOURCES, even the cuMemGetInfo( & FreeDeviceMemorySize ) is still above 1GB (we did not allocate really anything yet).

How one then should allocate large amounts of memory from the kernels?

Thanks!

cudaDeviceSetLimit does not allocate any device memory, and cudaMemGetInfo does not indicate the size of the available device heap memory.

A solid indication that you have run out of heap space is when a device side allocation returns a null pointer. I suggest, at least for debug purposes, that you check and trap for this perhaps using a device assert, until you get things sorted out.

If you get a null pointer (either from device malloc or device new) you have run out of heap space.