My application is currently exhibiting a problem at seemingly random occasions where CUDA_OUT_OF_MEMORY is returned by functions that, according to the documentation, don’t return that error code.
The error occurs inconsistently, during around 10% of the runs of a larger application, and at various points of the computation. This makes it difficult to narrow down its source or provide a minimal example. However, I have logged the estimated free memory given by cuMemGetInfo before and after every allocation or free and have gathered the following observations:
The first function to return the error code is different for different runs, such as cuStreamCreate or cuEventRecord, but the error is never returned by any allocations or operations that would expectably require considerable amounts of memory. Mind that cuEventRecord’s documentation, for example, does not even list CUDA_OUT_OF_MEMORY as a potential return code.
When the error occurs, the most recent call to cuMemGetInfo, invoked after the most recent allocation, estimates the free memory of the device to be within the magnitude of a GB, with the lowest value observed so far being 742024807 bytes. A safety margin of this size is intended. Be aware that this should be the only application operating on that GPU at the time of the error and that the GPU is in TCC mode. The logs of cuMemGetInfo’s output seem to match this, as no indications of the free memory estimate changing can be found outside of allocations or frees. On the CPU side, there should also be abundant amounts of RAM available.
This leaves me with the following three questions:
What are possible reasons for CUDA_OUT_OF_MEMORY being returned outside of allocations, when large amounts of memory seem to be available?
How does it happen that cuEventRecord is the first function to return CUDA_OUT_OF_MEMORY, if that function is not intended to return that particular error code?
What options do I have to narrow down the source of the error in my particular case?