Setup two trivial OpenMP programs, each creating a context on the first device: first version uses the diver interface, second version uses the runtime interface.
nvidia-smi command reports that the driver version consumes about 40mb per thread while the runtime version consumes a steady 49mb no matter how many threads are in play.
The inner code for each version is below.
Any ideas on why there’s a discrepancy?
The reason I ask is that we have an application that works fine using the driver initialization, yet fails when we use the runtime initialization. We want to use the runtime initialization to keep the memory footprint down.
// Driver version
CU(cuInit(0));
#pragma omp parallel for
for (int i = 0; i < omp_get_num_threads(); ++i) {
CUdevice dev;
CU(cuDeviceGet(&dev, 0));
CUcontext ctx;
CU(cuCtxCreate(&ctx, 0, dev));
CUdeviceptr d;
CU(cuMemAlloc(&d, sizeof(float)));
}
// Runtime version
#pragma omp parallel for
for (int i = 0; i < omp_get_num_threads(); ++i) {
CUDA(cudaSetDevice(0));
void *d;
CUDA(cudaMalloc(&d, sizeof(float)));
}