Hi,
I have a a program that runs identical CUDA code (on different sets of data
of the same size) on 2 GPUs for multiple iterations.
In each iteration:
- Copy data array 0 from CPU to GPU 0 and data array 1 to GPU 1
- Use cuFFT to compute convolution of the input data and a filter (the filter is previously computed and stored on each GPU)
- Copy convolution results back to CPU
Problem:
When running using 2 Tesla M40s, the performance is good at the
beginning, but after ~20 iterations, GPU 1 starts slowing down and get upto
4.5x worse, while GPU 0 remains to perform well consistently throughout all
the iterations.
Performance problem details
After ~20 iterations all operations (involving GPU 1) are slow, including:
- Data movement between host and device 1 (via cudaMemcpy())
- 3D FFT and IFFT using cuFFT
- Point-wise multiplication
I tried running the same program on 2 Tesla K80s. The performances are
consistent across both GPUs and throughout all the iterations on this node.
Has anyone had the same problem? Any insights or suggestions on how to investigate this further would be
highly appreciated.