Terminating a multi-threaded CUDA program that uses cuSOLVER : exit() Vs _exit()

karthikeyan_d · January 21, 2022, 10:35am

I am writing a multi-threaded progrm using cuSOLVER APIs like cusolverDnSgetrf(), cusolverDnSgetrs() etc, on Linux. The main thread of the program creates one child POSIX thread per GPU, which in turn use the cuSOLVER APIs and execute independently. All threads are host threads.

There are three types of termination possible :

T1). Normal termination. The main thread waits for the per-GPU child threads to
finish and then calls exit().
T2). Timed termination. If the program is given an execution time, the main
thread sleeps for that time, wakes up and calls exit().
T3). Ctrl-c termination. The program has a handler for SIGINT which calls exit().

There are no problems with T1. However, there are failures with T2 and T3 if exit() is used. No problems seen, so far, if _exit() is used.

Different failures are seen with exit() in T2 and T3 in different executions :

cudaDeviceSynchronize() fails with error code 4 (cudaErrorCudartUnloading)
cudaMallocManaged() fails with error code 4
cuSOLVER initialization fails with error code 7

In both T2 and T3, the main thread does not inform the child threads about the termination.

I have the following questions :

Q1). Why are failures seen with exit() but not with _exit()?
I am guessing that CUDA is registering some exit handlers which are executed as part of exit() and not with _exit(). I am not sure how that leads to failures. Perhaps the exit handlers are deleting stuff that the GPU code is using. But then, _exit() too will stop the host code while the GPU code is still running.

Q2). Is it safe to use _exit() instead of exit()?
Are there any issues with using _exit()? For example, will a CUDA kernel launched by the process will be left running when the processor terminates using _exit() or does the CUDA runtime (or the driver) ensure that all kernels associated with a terminating process are also terminated?

Q3). Is it mandatory/better for the main thread to ask the child threads to stop before terminating the process?
I have tried a method in which the main thread sends a signal to the child threads, which catch it and execute pthread_exit(). No failure seen with this, so far. Still under testing.

Thanks
Karthik

Robert_Crovella · January 21, 2022, 3:11pm

The difference between exit and _exit is documented. The error 4 (cudartErrorUnloading) is a result of something calling a cuda runtime API call. It seems evident that CUDA registers exit handlers, which are called with exit() but not _exit() as documented. Therefore, since no cuda runtime API calls are occurring upon the call of _exit(), no error 4 occurs.

Termination of the owning process should be sufficient for all CUDA resources used by that process to be cleaned up . This is the behavior of the “CUDA runtime”, i.e. that entity that manages the GPUs, and that you interact with when you make CUDA runtime and driver API calls. Beyond that I cannot comment on “safety” or other question you may have.

karthikeyan_d · January 22, 2022, 1:58am

This certainly explains why the failures are not seen with _exit(). However, it does not explain why the failures happen with exit(), especially when the following is said :

Termination of the owning process should be sufficient for all CUDA resources used by that process to be cleaned up .

So, calling exit() should also be fine, right? Since the cuda exit handlers will be called in this, I assumed that they will take care of stopping kernels running on the CPU and freeing up other resources too. But I am unable to understand why failures happen with exit().

Thanks

Robert_Crovella · January 22, 2022, 3:28am

Things are happening at the same time. CUDA spins up its own threads of activity. If the CUDA context gets destroyed while there is still CUDA teardown going on, you’ll see that error.

You can see the error yourself by calling a destructor with CUDA calls in it from global scope. CUDA begins teardown at the time main gets exited. Then destructors are called after that, and the destructors attempt to use the CUDA runtime API while CUDA is being torn down.