Use of driver API in DLLs causes a hang on exit on some configurations


I’m facing a strange problem where a CUDA program, that uses driver APIs, when built as a DLL, causes an executable that links to the DLL to hang after it exits main(). The process then has to be killed from the task manager.

The same code when built as an executable works fine. Also, this problem occurs only on certain configurations (details later in the post).

DLL.cpp: This file is compiled and built into a DLL.

#include <stdio.h>
#include <stdlib.h>
#include <math.h>

#define CUDA_SAFE_CALL(x)                                         \
  do {                                                            \
    CUresult result = x;                                          \
    if (result != CUDA_SUCCESS) {                                 \
      const char *msg;                                            \
      cuGetErrorName(result, &msg);                               \
      fprintf(stderr, "CUDA: %s\n", msg);                         \
    }                                                             \
   } while(0)

// For the CUDA runtime routines (prefixed with "cuda_")
#include <cuda_runtime.h>
#include <cuda.h>

__declspec(dllexport) int __cdecl
    // Error code to check return values for CUDA calls
    cudaError_t err = cudaSuccess;

    // Print the vector length to be used, and compute its size
    int numElements = 50000;
    size_t size = numElements * sizeof(float);

    // Allocate the device input vector A
    float *d_A = NULL;
    err = cudaMalloc((void **)&d_A, size);

    if (err != cudaSuccess)
        fprintf(stderr, "Failed to allocate device vector A (error code %s)!\n", cudaGetErrorString(err));

    // Free device global memory
    //err = cudaFree(d_A);

    if (err != cudaSuccess)
        fprintf(stderr, "Failed to free device vector A (error code %s)!\n", cudaGetErrorString(err));

    printf("Hello world\n");
    return 0;

main.cpp: This file is built as an executable that uses the DLL above.

int mainEntry(void);

int main()

The executable hangs at exit (i.e., after printing “Hello world”) on the following configuration:
Windows 7 Professional
NVIDIA driver 369.3
CUDA Toolkit 8.0
GeForce GTX 980ti / GTX 960 / GTX 560Ti

This looks to be a CUDA driver bug and any workaround suggestions would be very helpful.


  • Vaivaswatha.

Update to CUDA 8.0.61 and a driver that is appropriate for 8.0.61. 369.3 is no longer a correct driver to use with CUDA 8, depending on the version of CUDA 8.

Why are you mixing CUDA runtime API functions:

err = cudaMalloc((void **)&d_A, size);

with CUDA driver API functions:


You are not meeting the proper requirements for runtime API/driver API interoperability as defined here:

Thank you for the response. I’ll try the driver update.

What exactly in the interoperability requirement am I missing?

I have a call to cudaMalloc() before the driver API usage which will setup a Context

Since there now is a default context setup, using the driver API cuMemFree should work.

One aspect I just realised could be related to the requirement of pushing and popping contexts from library code.

Could that be it? Anyway I’ll give that a try.

A normal driver API code will establish a context and make it current, before invoking further driver API functions that are intended to refer to that context.

In the interoperability section, I read the following:

“If the runtime is initialized (implicitly as mentioned in CUDA C Runtime), cuCtxGetCurrent() can be used to retrieve the context created during initialization. This context can be used by subsequent driver API calls.”

So in this case, since you are implicitly establishing a context via the runtime API (cuInit does not establish a context, cudaMalloc does, implicitly), I would expect to see the retrieval of that context and making it current for the driver API, before invoking driver API calls that refer to that context. The example given in the quoted text is the use of cuCtxGetCurrent(). I don’t see that call, or any driver API context calls, in your posted code.

I do not call cuCtxGetCurrent() because I did not have the need to specify the context to the driver API calls I made. In other words, I am okay with the default context that is already set by the runtime API.

Do you mean to say that I must call cuCtxGetCurrent() and use the result to set cuCtxSetCurrent()? The driver API is expected to just use the current context automatically right, as long as it has been created.

Also, In my code above, even if I just use cudaFree instead of cuMemFree, just the presence of cuInit() is sufficient to cause the hang.

It looks like I have a solution to the problem.

Based on this I created a new context and pushed it on the stack before executing GPU code in the DLL (and later popped it back after the work was done). That fixed it for me.

Thank you @txbob.