From our device-side code we would need to allocate a large amount of memory to perform some device-side computations that are not visible to the host. We are using device cudaMalloc to allocate memory and host cudaDeviceSetLimit(cudaLimitMallocHeapSize,TotalDeviceMemorySize), where TotalDeviceMemorySize obtained by calling cuMemGetInfo( & FreeDeviceMemorySize, & TotalDeviceMemorySize );
The cudaDeviceSetLimit does not seem to allocate any device memory, as the cuMemGetInfo still returns the same FreeDeviceMemorySize.
After calling cudaDeviceSetLimit, we are able to continue allocations from both host side and device side for some time, including ability to launch kernels. But soon the kernel launches start failing with CUDA_ERROR_LAUNCH_OUT_OF_RESOURCES, even the cuMemGetInfo( & FreeDeviceMemorySize ) is still above 1GB (we did not allocate really anything yet).
How one then should allocate large amounts of memory from the kernels?