I have a Linux systems with 8 Tesla K80 Cards, with a total of 16 GPUs. I would like to send information to all of the GPUs at the same time. At the moment I am using a cudaMalloc and cudaMemcpy to all of the GPUs, using a different CPU thread to handle each GPU. However, I see a large time difference in the transfer time per GPU depending on the number of GPUs. Sending the information to one GPU is 0.3s, to 8 GPUs is 2.7s per GPU or per CPU thread.
What is the best way to send information to all GPUs simultaneously?