When multiple ranks accessing multiple GPUs leads to large cudaMalloc() times.

I am in the process of porting an application to a GPU cluster. I have employed two approaches to utilize multiple GPUs and I am getting some unexpected numbers. Can anyone please shed some light on what could be going wrong?

Approach 1:

  • Number of ranks launched is smaller or equal to the number of GPUs.Each rank gets assigned equal number of GPUs and distributes its work load
  • Each rank gets assigned to GPU(s) exclusively i.e. only a certain rank is going to launch kernels on a given GPU

Approach 2:

  • Large number of ranks are launched, ranks can be much larger than the number of GPUs available on the node.
  • multiple ranks can access same GPU.
  • Whenever a rank has some work ready, it launches a kernel on the GPU assigned to it
  • kernel launches by multiple ranks on the same GPU are managed by the default settings of Nvidia's MPS

The GPU execution time of Approach 2 is about 5x slower than Approach 1. nvprof tells me that most of the time is spent in cudaMalloc() calls. When I halve the number of ranks in Approach 2, the cudaMalloc() time is also halved. Can anyone please explain what is going on here? What is the best way of implementing such approach where multiple CPUs and multiple GPUs are involved.

I am more interested in using approach 2 because that drastically decreases the time it takes to perform CPU tasks prior to launching GPU kernel. And gives better CPU utilization.