cudaMallocPitch returning 999 until restart

michael.eisel · December 30, 2023, 9:46pm

I ran into a very strange issue. cudaMallocPitch(&ptr, &pitch, 40000, 5000) suddenly started returning result code 999. I checked nvidia-smi and the memory usage on the GPU was low, about 10MB/8GB. I then restarted my machine and cudaMallocPitch started returning 0 again, as expected. The GPU is a 3070 Ti. What might have happened?

njuffa · December 30, 2023, 10:58pm

Gremlins? Kobolds? Leprechauns?

All joking aside, there is too little information to speculate intelligently. A common scenario behind “allocation failed although requested size is less than total available memory” is fragmentation of the free memory, but given the extremely low memory usage in this instance, it seems highly unlikely.

Error code 999 is cudaErrorUnknown. My guess is therefore that this could be caused by some sort of hibernation (sleep mode) where the previous state of the CUDA driver was not successfully restored in full upon waking. CUDA may detect some inconsistencies in its internal state, causing it to throw this error.

michael.eisel · December 31, 2023, 3:27am

Gotcha. How would I “reset” the GPU so to speak without restarting the machine, to eliminate the possibility of things like hibernation, fragmentation, etc. causing this? I’m writing this code to run on a server, where I want it to be left unattended and able to recover itself to some extent.

njuffa · December 31, 2023, 3:59am

In my limited experience, servers do not enter into any kind of sleep mode. The goal of server operators is too keep such expensive hardware well utilized, e.g. never below 25% utilization.

Complications from sleep mode are primarily an issue on laptops, which enter sleep modes to conserve energy. Note that CUDA error 999 being caused by issues when waking from sleep is just a guess on my part.

I don’t have detailed insights into sleep modes. As far as I know, modern systems provide multiple sleep modes of different “depth”. From my experience with Windows 10 (completely independent of any use of CUDA), there can be issues with waking that usually require a system reboot to fix. I have solved this on my own systems by disallowing sleep of any kind (sorry, I do not recall the relevant configuration settings for this). Whether issues with sleep/wake cycling are the fault of the hardware, OS, drivers, apps, or all of the above, I do not know.

Memory fragmentation is typically something that happens when a memory allocator (of any kind, e.g. CUDA, C++ runtime, operating system) runs over an extend period of time and many allocations plus de-allocations are performed. It is the latter part that can convert an allocator’s memory map into Swiss cheese in the long run. It usually does not become an issue unless memory use nears the limit of capacity.

First line of defense is to avoid frequent allocation / de-allocation, minimize the number of different sizes used in allocation requests, and never plan on using more than 80% of total user memory.

If that is not doable, you may want to restart the app when memory allocations start failing. My expectation here is that long running applications (think BOINC, Folding@Home etc) have a checkpoint mechanism that allows for restarts.

If that is not acceptable, you can have the app grab all required memory at the start of the app, then parcel out the memory using your own app-specific allocator. That could be a slab allocator, a memory pool, a buffer ring, really anything you desire and that makes sense for the use case. Custom allocators can be developed in as little as a couple days; I have written three or four over the years.

Topic		Replies	Views
how to effectively free large memory allocation CUDA Programming and Performance	8	7915	November 5, 2015
GPU breaks down after error CUDA Programming and Performance	1	819	November 3, 2010
Memory problem (bis repetita) CUDA Programming and Performance	1	3974	March 20, 2008
cudaMalloc error in big loop CUDA Programming and Performance	12	15763	May 21, 2008
GPU breaks down after error CUDA Programming and Performance	3	10791	November 16, 2010
How to clear cuda errors? CUDA Programming and Performance	7	806	June 14, 2024
cudaMalloc failing cuda malloc failing CUDA Programming and Performance	0	2028	August 8, 2011
CUDA out of memory need to reboot the server CUDA Programming and Performance	7	2979	November 30, 2009
Program stucks on cudaErrorMemoryAllocation after failing a cudaMallocHost CUDA Programming and Performance	12	301	August 23, 2024
Cuda samples fail to allocate memory after running a few pieces of code CUDA Setup and Installation	5	2935	September 6, 2016

cudaMallocPitch returning 999 until restart

Related topics