Hi, I was trying to optimize an image processing algorithm on a multi-gpu system consisting of 2 x Tesla K40c. Images are huge ( almost 10 GB each ) and are subdivided in bursts of almost 400 MB each. The algorithm is able to process 2 bursts in parallel (so each GPU has its own dedicated host thread).
In order to optimize data transfers between host and device memory I’ve introduced pinned memory, in particular the input host buffer is allocated using cudaHostAlloc() and flags cudaHostAllocPortable and CudaHostAllocWriteCombined while the output host buffer is allocated only with the flag cudaHostAllocPortable.
The size of the input/output buffers in host memory is equal to the size of 2 consecutive bursts and memcopies are always performed on distinct address spaces.
The problem is that the throughput of the cudaMemcpyHostToDevice is almost 10 GB/s as expected, while those for cudaMemcpyDeviceToHost is only 330 MB/s.
I’ve made some tests (in single GPU mode) using the bandwidthTest application in the Nvidia samples and results are very irregular ( sometimes I get 10 GB/s and sometimes 2 GB/s):
************** TEST 1 **************
[CUDA Bandwidth Test] - Starting…
Running on…
Device 0: Tesla K40c
Quick Mode
Host to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 10264.4
Device to Host Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 1944.4
Device to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 183856.9
Result = PASS
************** TEST 2 **************
[CUDA Bandwidth Test] - Starting…
Running on…
Device 0: Tesla K40c
Quick Mode
Host to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 10263.0
Device to Host Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 10252.9
Device to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 183492.8
Result = PASS
The machine that I’m using mounts Linux 64-bit. Since it is a NUMA architecture, I’ve tried to force the kernel to schedule threads always on the same CPU cores but I got the same results.
I’ve also checked that GPUs are mounted on PCIe 3.0 x16.
Did anyone of you have similar issues or have some idea on how to solve it?
Many thanks