I know which this question is frequently asked in topics of Developers Forums, but i can’t solve my problem without make my own ask with detailed explanation.
I have make a program which that try to use a full memory of one card of my Titan-Z, the overhead of data transfers means more than 90% of process time. I’m trying offload a buffer of ~32 millions ints to GPU memory, but i can’t reach bandwidth bigger than 3 GB/s on transfer.
It’s very important explain which my code uses cudaMallocHost in a single time to allocate full memory of GPU. When i’ve used Pinned memory i’ve achieved more than ~7 GB/s in data transfer but my computation on GPU turns more expensive after, thus, the final time (computation time + data transfer) is bigger than when i use Pageable memory.
Exists another way to improve bandwith without grow of time computation on GPU? I’ve tried use cuda Streams to parallelize transfer of array chunks but my bandwidth remained the same.
My hardware especifications is following:
Titan-Z:
6 GB (single card used)
GDDR5 364 bit (of single card)
Maximum BW: 336 GB/sec
PCI-Express 3.0 x16 (diagnosed in GPU-Z)
Motherboard bus:
PCI-Express 3.0 x16 (diagnosed in CPU-Z)
Are you absolutely sure using pinned memory on host side makes computation happening on device more expensive? Doesn’t sound right to me.
Also, if the data transfer accounts for 90% of your program’s time, then what you’re trying to do probably isn’t a good candidate for massive parallelization on GPU.
Pinned memory doubles the bandwidth (approximately) which halves the transfer time (approximately).
However the process of pinning memory using cudaHostAlloc takes much longer than an allocation using malloc.
The net effect on an application which only does a single transfer from/to pinned memory is basically negligible, in terms of execution time. The reduced transfer time is offset by the increased pinning (allocation) time.
Pinned memory will provide a benefit if:
multiple transfers are made to/from the buffer
overlap of copy and compute operations is desired (in which case pinning is mandatory)
remote diagnosis is hard. It is close to impossible to give specific advice in this situation without seeing your code and your benchmarking framework and knowing a bit more about your system.
A number of things look odd here, in particular your statement that the application runs more slowly with pinned memory, and the fact that you achieve at most 7 GB/sec throughput although PCIe is configured for gen3 x16. What are the block sizes of the transfers? You would need blocks of 16 MB or more to achieve full throughput.
Is this a dual socket system by any chance? If so, make sure to use appropriate CPU and memory affinity settings such that the GPU “talks” to the “near” CPU/memory. What CPU is in this system? Are you using simultaneous H2D and D2H transfers? If so, your system memory may not provide sufficient bandwidth, based on your description around 21 GB/sec theoretical peak throughput, not enough to drive full-duplex PCIe gen3 at 12 GB/sec per direction.
If there are multiple PCIe slots on the board, they may not all be created equal. Or, the CPU may not provide enough PCIe lanes to allow all PCIe devices in the machine to operate at the x16 settings, causing demotion to x8 operation. There could also be system BIOS configuration issues.
If you run the application with the profiler, does it clearly show that the GPU and/or transfers to the GPU are the bottleneck? Is it possible there are host-side issues that have not been explored in enough detail?
As njuffa mentioned, maybe the CPU does not have enough PCIe lanes to support 2 GPUs at PCIe 3.0 x16.
There are only a few CPU which have 40 lanes, such as the i7 5930k 3.5 Ghz or the older i7 4820k 3.7 GHz. Newer popular CPU such as the 5820 and the 4790 do not have enough lanes to support 2 GPUs PCIe-3.0 x16.
Which exact CPU and which exact motherboard are you using? How many GPUs total?
Hello. I’m reading a paper about dynamic managing GPU memory during training process of DNN(Capuchin,http://alchem.usc.edu/portal/static/download/capuchin.pdf). And it said ‘Second, because pinned memory transfer occupies unidirectional PCIe lane exclusively, a swap cannot start until its preceding swap finishes.’
So I’m wondering if I used pinned memory, can I transfer multiple array at the same time? If I can, then how does CUDA distribute the bandwidth of each transfer task?
PCIe is a full-duplex interconnect. This means that if you have a GPU with at least two copy engines, a host->device transfer can occur concurrently with a device->host transfer.
However, at any given time there can only be one active transfer in a particular direction, so two transfers in the same direction will happen consecutively in issue order.
This is really orthogonal to transfers from/to pinned memory. Transfers from/to pageable memory basically also use a pinned memory buffer which however is internal to the driver, because transfers by DMA require a physically contiguous buffer. The overhead of transfers to pageable memory comes from needing an additional copy for moving data between user’s data space and this internal pinned buffer. The faster the host’s system memory, the less overhead there is.
A given transfer is referenced to a single pointer. If your multiple arrays are all referenced via a single pointer (e.g. they are contiguous in memory) then yes you can transfer them at the same time, and the behavior is no different than a single transfer of a large array. If your multiple arrays are referenced via multiple pointers (they are not contiguous, and also not arranged via strides) then it is not possible to transfer them “at the same time”. They require a separate transfer request for each array, and the behavior will be as described above by njuffa.