catching CUDA out-of-memory in OpenACC program

Hi,

if the problem size gets too big and doesn’t fit on the GPU I get an out-of-memory error thrown by the CUDA runtime along with a dump of lots of lines of text. In a commercial setting I would like to catch this error and print a shorter and more useful error message for the customer.

What’s the best way of doing this in an OpenACC programm. Does the Cuda runtime throw a standard C++ exception that I can catch with a try/catch block?
Obviously, it would be even better if the program could anticipate the memory requirements given the user setup and exit with a proper error message early on in the program execution, i.e. way before it exhausts the GPU memory, but unfortunately we are not there yet.

Thanks,
LS

Hi LS,

What happens in the runtime is that the PGI OpenACC runtime is calling cudaMalloc. When the device is out of memory, cudaMalloc returns 0, and the PGI runtime then gives the error. My guess is that it’s possible to have our runtime make a call-back which the parent program could than catch to handle the error. I added an RFE (TPR#25555) and will talk with our developers if we could add something like this. Also, we might want to standardize it as part of OpenACC so that it’s the same across all compiler vendors. Of course, something like this will take time.

For the immediate need, I wondering if you can compile your program using CUDA Unified Memory (-ta=tesla:managed)? With Pascal and Volta devices, CUDA Unified Memory can oversubscribe the GPU memory so you wont run out of memory. You’ll loose performance, but the program wont crash. The caveat being that on dynamic memory is managed.

There are also runtime calls to query the device and check how much memory it has as well as the amount of free memory. Though, you would need to know if the problem size would fit.

-Mat

Hi Lutz,

After talking with one of our compiler engineers about this, it turns out that we already have a call-back mechanism that you can use to intercept OpenACC runtime errors. Here’s the prototype from the call back’s register routine:

 typedef void (*exitroutinetype)(char *);
 extern void acc_set_error_routine(exitroutinetype routine);

The application calls ‘acc_set_error_routine’ with the name of a callback routine. When the OpenACC runtime detects a runtime error, it will invoke the exitroutine. The string argument contains the error message.

Note: This is NOT error recovery. If the callback routine returns to the application, the behavior is decidedly undefined.

-Mat

For reference, we’ve added documentation on acc_set_error_routine to our UG: https://www.pgroup.com/resources/docs/18.7/x86/pgi-user-guide/index.htm#openacc-error-handling