OpenMP: driver and runtime APIs have very different memory consumption

Setup two trivial OpenMP programs, each creating a context on the first device: first version uses the diver interface, second version uses the runtime interface.

nvidia-smi command reports that the driver version consumes about 40mb per thread while the runtime version consumes a steady 49mb no matter how many threads are in play.

The inner code for each version is below.

Any ideas on why there’s a discrepancy?

The reason I ask is that we have an application that works fine using the driver initialization, yet fails when we use the runtime initialization. We want to use the runtime initialization to keep the memory footprint down.

// Driver version
    #pragma omp parallel for
    for (int i = 0; i < omp_get_num_threads(); ++i) {
        CUdevice dev;
        CU(cuDeviceGet(&dev, 0));
        CUcontext ctx;
        CU(cuCtxCreate(&ctx, 0, dev));
        CUdeviceptr d;
        CU(cuMemAlloc(&d, sizeof(float)));

// Runtime version
    #pragma omp parallel for
    for (int i = 0; i < omp_get_num_threads(); ++i) {
	void *d;
	CUDA(cudaMalloc(&d, sizeof(float)));

Turns out that, for CUDA 4.0 and up, the runtime API uses the same context for all threads while with the driver API I was creating new contexts, each requiring more memory.