Tesla M40 Multi-GPU Performance Issue


I have a a program that runs identical CUDA code (on different sets of data
of the same size) on 2 GPUs for multiple iterations.

In each iteration:

  • Copy data array 0 from CPU to GPU 0 and data array 1 to GPU 1
  • Use cuFFT to compute convolution of the input data and a filter (the filter is previously computed and stored on each GPU)
  • Copy convolution results back to CPU

When running using 2 Tesla M40s, the performance is good at the
beginning, but after ~20 iterations, GPU 1 starts slowing down and get upto
4.5x worse, while GPU 0 remains to perform well consistently throughout all
the iterations.

Performance problem details
After ~20 iterations all operations (involving GPU 1) are slow, including:

  • Data movement between host and device 1 (via cudaMemcpy())
  • 3D FFT and IFFT using cuFFT
  • Point-wise multiplication

I tried running the same program on 2 Tesla K80s. The performances are
consistent across both GPUs and throughout all the iterations on this node.

Has anyone had the same problem? Any insights or suggestions on how to investigate this further would be
highly appreciated.

During the slowdown, take a look at the output of nvidia-smi -a in that case/for that time. Take a look at the GPU temperature and also whether any clock slowdown reasons are listed - also compare clock speeds between that case and the “normal” case.