[CUDA8.0 BUG?] Child process forked after cuInit() get CUDA_ERROR_NOT_INITIALIZED on cuInit()

kkaigai · October 27, 2016, 2:46am

Hello, I could observe a degradation at CUDA7.5 → 8.0.

Once a process does cuInit(), then, its child processes forked after the cuInit() gets CUDA_ERROR_NOT_INITIALIZED error on own cuInit(). It never happen on the previous CUDA7.5, but CUDA8.0 always makes this error.
Somebody other have seen the similar problems?

Below is the code to reproduce:

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <cuda.h>

#define elog(FORMAT,...)                                \
    do {                                                \
        fprintf(stderr, FORMAT "\n", ##__VA_ARGS__);    \
        exit(1);                                        \
    } while(0)

static int child_proc(void)
{
    CUdevice    device;
    CUresult    rc;

    rc = cuInit(0);
    if (rc != CUDA_SUCCESS)
        elog("pid=%u failed on cuInit: %ld", getpid(), (long)rc);

    rc = cuDeviceGet(&device, 0);
    if (rc != CUDA_SUCCESS)
        elog("cuDeviceGet failed: %ld", (long)rc);

    return 0;
}

int main(int argc, char *argv[])
{
    CUresult    rc;
    pid_t       child;
    int         status;

    /* general initialization process */
    rc = cuInit(0);
    if (rc != CUDA_SUCCESS)
        elog("parent: failed on cuInit: %ld", (long)rc);

    /* connection accept, then fork a backend process */
    child = fork();
    if (child == 0)
        return child_proc();
    else if (child > 0)
        wait(&status);
    else
        elog("failed on fork(2): %m");

    return 0;
}

Execution example:

[kaigai@ayu ~]$ ./a.out
pid=10550 failed on cuInit: 3

It shows the cuInit() on the parent process get succeeded, but cuInit() on the child process gets failed.
It does not mean that child processes don’t need to call cuInit(), because the next cuDeviceGet() will fail even if I commented out the cuInit() on the child process side.

This kind of CUDA usage is very usual scenario on the server type software, and I could use the CUDA driver APIs at CUDA7.5. What is the reason of this mysterious behavior?

Software versions:
CUDA installation: 8.0.44 (Linux; runfile)
NVIDIA driver: 367.55

Robert_Crovella · October 27, 2016, 3:07am

No, it’s not usual usage for CUDA.

If you’re going to fork a process, the CUDA advice for a long time was not to establish a CUDA context before the fork.

There are many references to this in a variety of materials.

For example, consider this comment in the CUDA simple IPC sample code:

// We can't initialize CUDA before fork() so we need to spawn a new process

This has never been proper CUDA behavior, and I wouldn’t try to explain your observations on CUDA 7.5

kkaigai · October 27, 2016, 3:34am

It never constructs a CUDA context prior to fork(), just cuInit().
Do you mix up the problem?

If this manner is really illegal, for example, a server process has to launch an external program to log number of GPU devices on startup time, or other trivial stuff.
I don’t think it is a reasonable restriction.

Robert_Crovella · October 27, 2016, 12:53pm

Did you read the comment I quoted from NVIDIA engineers?

It says

“We can’t initialize CUDA before fork()”

So you should not run cuInit before a fork, if you want access to CUDA in a process spawned by the fork.

You don’t have to launch an external program.

You just have to spawn a process to do what you want.

Take a look at the sample code I indicated, it does exactly that.

jthemphill · October 11, 2023, 7:57am

@Robert_Crovella : I believe this is impacting Tensorflow. See `Failed setting context: CUDA_ERROR_NOT_INITIALIZED: initialization error` at fork · Issue #57877 · tensorflow/tensorflow · GitHub

Well, a Tensorflow user is not directly running cuInit(). Instead, Tensorflow users are running import tensorflow, which may call cuInit() as a side effect. Hopefully you can understand why users might need to import tensorflow in a parent process, and then also import tensorflow in a child process. AlphaZero does just this, as it spawns processes to play games against each other and spawns a learner process to observe the games and train the model.

Robert_Crovella · October 11, 2023, 1:29pm

I’m not aware of any change in behavior in CUDA in this regard. I’m definitely not a TF expert, but yes, I could imagine this issue impacting anyone using TF if they don’t follow the “rule”.

No, sorry, I don’t. You’re suggesting this is necessary:

    Parent Process (initializes CUDA)
       |                         |      
  child process1             child process 2

I don’t know why it cannot be refactored to:

    Parent Process (does not initialize CUDA)
       |                         |                          |
  child process1             child process 2      child process3
                                                initializes CUDA, 
                                                does whatever the parent process would have done.

and just as you would, use IPC for whatever process communication is needed. In fact you said as much yourself when you said:

That should be fine. Have a parent process that does not initialize CUDA. That parent process spawns any number of game processes, and also spawns a learner process to observe the games.

Anyway we don’t need to litigate this here. It’s entirely possible that there are things I don’t understand. Furthermore, I am not in control of CUDA behavior. Anyone desiring to see a change in CUDA behavior is welcome to file a bug.

opengpu · April 12, 2024, 5:45am

Parent Process (initializes CUDA), this only mainProcess consume 108MB GPU memory.

////////////////////////////////////////////////////

Parent Process
|
child process1 (initializes CUDA), this childProcess(by fork, execl) consume 186MB GPU memory.

why?

Robert_Crovella · April 12, 2024, 3:38pm

separate processes (even parent/child) will create separate CUDA contexts, even on the same GPU.

As to why there is variability in the size of a CUDA context from one place to another, there is no specification or precise description of what is in a CUDA context and what affects it. Therefore variation is possible.

Topic		Replies	Views
all CUDA-capable devices are busy or unavailable problem in a multi-process Linux application CUDA Programming and Performance	24	3372	January 7, 2011
Failed call to cuInit: CUDA_ERROR_OUT_OF_MEMORY: out of memory Frameworks cuda , tensorflow	1	2918	April 22, 2021
Recoving after a TDR event CUDA Programming and Performance	14	2128	April 20, 2016
Weird behavior with multiple kernels in CUDA 1.0 CUDA Programming and Performance	5	5214	August 24, 2007
cuInit() memory leak CUDA Programming and Performance cuda	5	831	February 17, 2025
Does the latest GTX 1660 model support cuda? CUDA Setup and Installation	16	64751	October 1, 2023
cuda initialization CUDA Programming and Performance	4	11924	February 20, 2007
Failed to get convolution algorithm. This is probably because cuDNN failed to initialize cuDNN	29	51557	October 12, 2021
CUDA 9.0 ImportError: libcublas.so.8.0 CUDA Setup and Installation	17	39448	January 22, 2018
64 bit Windows 10, gtx 1060, CUDA kernel startup time? CUDA Programming and Performance	12	2837	October 10, 2017

[CUDA8.0 BUG?] Child process forked after cuInit() get CUDA_ERROR_NOT_INITIALIZED on cuInit()

Related topics