I am running a tiled Cholesky application on a Linux system with 4 x K40 and I am experiencing really low memory throughput of host <–> device transfers, even using pinned memory with cudaMemcpyAsync and streams.
I use 3 streams for data transfers (one for D2H, one for H2D transfers and one for D2D transfers) and several other streams for kernel launches.
For every kernel, first, the data chunks it needs are asynchronously transferred to the GPU and then, when the transfers are finished, the kernel is launched. All operations are asynchronous and I check for its completion with events. So, I can overlap data transfers with computations.
Launching the application with nvprof and later importing the output file to nvvp, I see the following:
- When running with only 1 GPU, everything works as expected: kernels overlap with data transfers, and the average throughput for host <--> device data transfers is around 10 GB/s for pinned chunks of 33.5 MB.
- When splitting the computation into 2 GPUs (I keep the same number of kernels, but this means that additional device to device transfers must be issued and they are actually issued in a separate stream, asynchronously): still data transfers are overlapped with kernels, but the throughput for host <--> device data transfers slows down to only 300 MB/s on average for pinned chunks of 33.5 MB. In this case, I see some data transfers (either H2D, D2H or D2D) that still get around 8 GB/s, which is what I expect, but for some reason, most transfers get a throughput lower than 400 MB/s. nvvp reports memory as pinned and shows the transfers in a stream different than 0. So, I have no idea why this happens.
Any ideas about this throughput slowdown? Does using 2 GPUs have some bad influence on memory bandwidth?
Please, let me know if you need further information, it’s my first post… :-)
Thanks in advance!