I’m processing on a multi-GPU device and I thought I could create the cuda contexts in parallel for each GPU using OpenMP with as many host CPU threads as GPU devices using the code below:
#pragma omp parallel
{
int i = omp_get_thread_num();
cudaSetDevice(i);
cudaFree(0); // To trigger the context creation
}
Threads start in parallel as expected. However, the time taken to create the context is linearly dependent on the number of GPUs I set, for all GPUs. That is, to initialize a single GPU takes roughly 200 ms. If I want to initialize 2 GPUs (with 2 host CPU threads), it takes 400 ms, for both GPUs, starting at the same time and finishing at the same time, 600 ms with 3 GPUs and 800 ms with all 4 GPUs, profiling with nvvp.
Can anyone shed some light on what is happening there? I was expecting the 4 GPUs to initialize in parallel in 200 ms. I’m using gcc-5.4.0 and cuda-8.0.61 .