Multi-GPU parallel context creation using OpenMP

I’m processing on a multi-GPU device and I thought I could create the cuda contexts in parallel for each GPU using OpenMP with as many host CPU threads as GPU devices using the code below:

#pragma omp parallel
int i = omp_get_thread_num();
cudaFree(0); // To trigger the context creation

Threads start in parallel as expected. However, the time taken to create the context is linearly dependent on the number of GPUs I set, for all GPUs. That is, to initialize a single GPU takes roughly 200 ms. If I want to initialize 2 GPUs (with 2 host CPU threads), it takes 400 ms, for both GPUs, starting at the same time and finishing at the same time, 600 ms with 3 GPUs and 800 ms with all 4 GPUs, profiling with nvvp.

Can anyone shed some light on what is happening there? I was expecting the 4 GPUs to initialize in parallel in 200 ms. I’m using gcc-5.4.0 and cuda-8.0.61 .

The CUDA runtime host code portions are not fully parallelizable in this way. There are various activities (such as reorganization of the process memory map) which cannot be done in parallel.

Considering that my 4 GPUs will work fully independtly, is there a way to parallelize the initialization?