[CUDA8.0 BUG?] Child process forked after cuInit() get CUDA_ERROR_NOT_INITIALIZED on cuInit()

Hello, I could observe a degradation at CUDA7.5 → 8.0.

Once a process does cuInit(), then, its child processes forked after the cuInit() gets CUDA_ERROR_NOT_INITIALIZED error on own cuInit(). It never happen on the previous CUDA7.5, but CUDA8.0 always makes this error.
Somebody other have seen the similar problems?

Below is the code to reproduce:

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <cuda.h>

#define elog(FORMAT,...)                                \
    do {                                                \
        fprintf(stderr, FORMAT "\n", ##__VA_ARGS__);    \
        exit(1);                                        \
    } while(0)

static int child_proc(void)
{
    CUdevice    device;
    CUresult    rc;

    rc = cuInit(0);
    if (rc != CUDA_SUCCESS)
        elog("pid=%u failed on cuInit: %ld", getpid(), (long)rc);

    rc = cuDeviceGet(&device, 0);
    if (rc != CUDA_SUCCESS)
        elog("cuDeviceGet failed: %ld", (long)rc);

    return 0;
}

int main(int argc, char *argv[])
{
    CUresult    rc;
    pid_t       child;
    int         status;

    /* general initialization process */
    rc = cuInit(0);
    if (rc != CUDA_SUCCESS)
        elog("parent: failed on cuInit: %ld", (long)rc);

    /* connection accept, then fork a backend process */
    child = fork();
    if (child == 0)
        return child_proc();
    else if (child > 0)
        wait(&status);
    else
        elog("failed on fork(2): %m");

    return 0;
}

Execution example:

[kaigai@ayu ~]$ ./a.out
pid=10550 failed on cuInit: 3

It shows the cuInit() on the parent process get succeeded, but cuInit() on the child process gets failed.
It does not mean that child processes don’t need to call cuInit(), because the next cuDeviceGet() will fail even if I commented out the cuInit() on the child process side.

This kind of CUDA usage is very usual scenario on the server type software, and I could use the CUDA driver APIs at CUDA7.5. What is the reason of this mysterious behavior?

Software versions:
CUDA installation: 8.0.44 (Linux; runfile)
NVIDIA driver: 367.55

No, it’s not usual usage for CUDA.

If you’re going to fork a process, the CUDA advice for a long time was not to establish a CUDA context before the fork.

There are many references to this in a variety of materials.

For example, consider this comment in the CUDA simple IPC sample code:

// We can't initialize CUDA before fork() so we need to spawn a new process

This has never been proper CUDA behavior, and I wouldn’t try to explain your observations on CUDA 7.5

It never constructs a CUDA context prior to fork(), just cuInit().
Do you mix up the problem?

If this manner is really illegal, for example, a server process has to launch an external program to log number of GPU devices on startup time, or other trivial stuff.
I don’t think it is a reasonable restriction.

Did you read the comment I quoted from NVIDIA engineers?

It says

“We can’t initialize CUDA before fork()”

So you should not run cuInit before a fork, if you want access to CUDA in a process spawned by the fork.

You don’t have to launch an external program.

You just have to spawn a process to do what you want.

Take a look at the sample code I indicated, it does exactly that.

1 Like

@Robert_Crovella : I believe this is impacting Tensorflow. See `Failed setting context: CUDA_ERROR_NOT_INITIALIZED: initialization error` at fork · Issue #57877 · tensorflow/tensorflow · GitHub

Well, a Tensorflow user is not directly running cuInit(). Instead, Tensorflow users are running import tensorflow, which may call cuInit() as a side effect. Hopefully you can understand why users might need to import tensorflow in a parent process, and then also import tensorflow in a child process. AlphaZero does just this, as it spawns processes to play games against each other and spawns a learner process to observe the games and train the model.

I’m not aware of any change in behavior in CUDA in this regard. I’m definitely not a TF expert, but yes, I could imagine this issue impacting anyone using TF if they don’t follow the “rule”.

No, sorry, I don’t. You’re suggesting this is necessary:

    Parent Process (initializes CUDA)
       |                         |      
  child process1             child process 2

I don’t know why it cannot be refactored to:

    Parent Process (does not initialize CUDA)
       |                         |                          |
  child process1             child process 2      child process3
                                                initializes CUDA, 
                                                does whatever the parent process would have done.

and just as you would, use IPC for whatever process communication is needed. In fact you said as much yourself when you said:

That should be fine. Have a parent process that does not initialize CUDA. That parent process spawns any number of game processes, and also spawns a learner process to observe the games.

Anyway we don’t need to litigate this here. It’s entirely possible that there are things I don’t understand. Furthermore, I am not in control of CUDA behavior. Anyone desiring to see a change in CUDA behavior is welcome to file a bug.

Parent Process (initializes CUDA), this only mainProcess consume 108MB GPU memory.

////////////////////////////////////////////////////

Parent Process
|
child process1 (initializes CUDA), this childProcess(by fork, execl) consume 186MB GPU memory.

why?

separate processes (even parent/child) will create separate CUDA contexts, even on the same GPU.

As to why there is variability in the size of a CUDA context from one place to another, there is no specification or precise description of what is in a CUDA context and what affects it. Therefore variation is possible.