CUDA_OUT_OF_MEMORY despite large amounts of memory available

RichardKarls · January 28, 2022, 9:47am

My application is currently exhibiting a problem at seemingly random occasions where CUDA_OUT_OF_MEMORY is returned by functions that, according to the documentation, don’t return that error code.

The error occurs inconsistently, during around 10% of the runs of a larger application, and at various points of the computation. This makes it difficult to narrow down its source or provide a minimal example. However, I have logged the estimated free memory given by cuMemGetInfo before and after every allocation or free and have gathered the following observations:

The first function to return the error code is different for different runs, such as cuStreamCreate or cuEventRecord, but the error is never returned by any allocations or operations that would expectably require considerable amounts of memory. Mind that cuEventRecord’s documentation, for example, does not even list CUDA_OUT_OF_MEMORY as a potential return code.
When the error occurs, the most recent call to cuMemGetInfo, invoked after the most recent allocation, estimates the free memory of the device to be within the magnitude of a GB, with the lowest value observed so far being 742024807 bytes. A safety margin of this size is intended. Be aware that this should be the only application operating on that GPU at the time of the error and that the GPU is in TCC mode. The logs of cuMemGetInfo’s output seem to match this, as no indications of the free memory estimate changing can be found outside of allocations or frees. On the CPU side, there should also be abundant amounts of RAM available.

This leaves me with the following three questions:
What are possible reasons for CUDA_OUT_OF_MEMORY being returned outside of allocations, when large amounts of memory seem to be available?
How does it happen that cuEventRecord is the first function to return CUDA_OUT_OF_MEMORY, if that function is not intended to return that particular error code?
What options do I have to narrow down the source of the error in my particular case?

striker159 · January 28, 2022, 11:21am

Are there any CUDA API calls with unchecked return values? Especially those which can return CUDA_ERROR_OUT_OF_MEMORY according to the documentation?

RichardKarls · January 28, 2022, 1:03pm

I have checked and the only API calls whose return code is not checked are cuDriverGetVersion during initialization as well as cuGetErrorName and cuGetErrorString during error handling.

Robert_Crovella · January 28, 2022, 3:50pm

A CUDA out-of-memory error can be returned in situations where the call would require establishment of a device context on a device that you are “not using” or that you have not yet established a device context on, and some other user/process is using that device, and probably has used up quite a bit of memory on that device.

This particular issue can only occur in a multi-GPU scenario, and can only occur when there is more than one user or process, using GPUs.

An example/write-up is here.

RichardKarls · January 31, 2022, 7:59am

If I understand you correctly, the out-of-memory error stems not from the device I am observing and have allocated resources on, but from another GPU that is actually out-of-memory and my process would implicitly try and fail to create a context on said other GPU.

In the scenario you linked, this occurred due to the call to cudaMemcpy synchronizing with both devices and therefore needing a context on each device. The implicit context creation runs into the out-of-memory error.

Is that a plausible scenario to occur during a call to cuMemGetInfo, which is not synchronizing and using the driver API, which to my understanding does not implicitly create contexts?

Robert_Crovella · January 31, 2022, 2:55pm

I don’t know and wouldn’t know without writing a test case, along the lines of the test case I linked. Your previous description suggested the error happened at:

I don’t know what is happening exactly, in your case.

Topic		Replies	Views
cuCtxCreate returning CUDA_ERROR_OUT_OF_MEMORY CUDA Programming and Performance	2	7539	December 22, 2008
Maximum amount of memory you can cudamalloc? CUDA Programming and Performance	5	15132	February 22, 2010
Cuda Out of Memory with tons of memory left? CUDA Programming and Performance	5	38981	December 23, 2009
Bug report: 8400M GS + Win7 errors errors and more errors CUDA Programming and Performance	0	4518	January 19, 2010
Why cuCtxCreate fails ? Return code is CUDA_ERROR_OUT_OF_MEMORY CUDA Programming and Performance	3	4759	October 1, 2008
free kernel code after execution CUDA Programming and Performance	8	4775	June 23, 2012
cudaFree is returning an unrecognised error code CUDA Programming and Performance	10	7939	March 13, 2009
CUDA driver API - multiple threads with the same CuContext CUDA Programming and Performance cuda	7	2301	October 28, 2022
CUDA_ERROR_INVALID_CONTEXT error in cuMemGetInfo strange behaviour, and contect error CUDA Programming and Performance	5	14176	March 26, 2009
Device Memory Mangement CUDA Programming and Performance	14	3445	December 5, 2008

CUDA_OUT_OF_MEMORY despite large amounts of memory available

Related topics