Why i can't use my full PCI-Express bandwidth?

isaiasfrederick · June 23, 2015, 7:08pm

Hello, Forum!

I know which this question is frequently asked in topics of Developers Forums, but i can’t solve my problem without make my own ask with detailed explanation.

I have make a program which that try to use a full memory of one card of my Titan-Z, the overhead of data transfers means more than 90% of process time. I’m trying offload a buffer of ~32 millions ints to GPU memory, but i can’t reach bandwidth bigger than 3 GB/s on transfer.

It’s very important explain which my code uses cudaMallocHost in a single time to allocate full memory of GPU. When i’ve used Pinned memory i’ve achieved more than ~7 GB/s in data transfer but my computation on GPU turns more expensive after, thus, the final time (computation time + data transfer) is bigger than when i use Pageable memory.

Exists another way to improve bandwith without grow of time computation on GPU? I’ve tried use cuda Streams to parallelize transfer of array chunks but my bandwidth remained the same.

My hardware especifications is following:

Titan-Z:
6 GB (single card used)
GDDR5 364 bit (of single card)
Maximum BW: 336 GB/sec
PCI-Express 3.0 x16 (diagnosed in GPU-Z)

Motherboard bus:
PCI-Express 3.0 x16 (diagnosed in CPU-Z)

My Host Memory:
16 GB DDR3 (2x Dual Channel)
1333 MHz

Thanks for everything!!!

Clochette · June 23, 2015, 7:21pm

Are you absolutely sure using pinned memory on host side makes computation happening on device more expensive? Doesn’t sound right to me.

Also, if the data transfer accounts for 90% of your program’s time, then what you’re trying to do probably isn’t a good candidate for massive parallelization on GPU.

Robert_Crovella · June 23, 2015, 7:28pm

Pinned memory doubles the bandwidth (approximately) which halves the transfer time (approximately).

However the process of pinning memory using cudaHostAlloc takes much longer than an allocation using malloc.

The net effect on an application which only does a single transfer from/to pinned memory is basically negligible, in terms of execution time. The reduced transfer time is offset by the increased pinning (allocation) time.

Pinned memory will provide a benefit if:

multiple transfers are made to/from the buffer
overlap of copy and compute operations is desired (in which case pinning is mandatory)

njuffa · June 23, 2015, 7:32pm

remote diagnosis is hard. It is close to impossible to give specific advice in this situation without seeing your code and your benchmarking framework and knowing a bit more about your system.

A number of things look odd here, in particular your statement that the application runs more slowly with pinned memory, and the fact that you achieve at most 7 GB/sec throughput although PCIe is configured for gen3 x16. What are the block sizes of the transfers? You would need blocks of 16 MB or more to achieve full throughput.

Is this a dual socket system by any chance? If so, make sure to use appropriate CPU and memory affinity settings such that the GPU “talks” to the “near” CPU/memory. What CPU is in this system? Are you using simultaneous H2D and D2H transfers? If so, your system memory may not provide sufficient bandwidth, based on your description around 21 GB/sec theoretical peak throughput, not enough to drive full-duplex PCIe gen3 at 12 GB/sec per direction.

If there are multiple PCIe slots on the board, they may not all be created equal. Or, the CPU may not provide enough PCIe lanes to allow all PCIe devices in the machine to operate at the x16 settings, causing demotion to x8 operation. There could also be system BIOS configuration issues.

If you run the application with the profiler, does it clearly show that the GPU and/or transfers to the GPU are the bottleneck? Is it possible there are host-side issues that have not been explored in enough detail?

CudaaduC · June 25, 2015, 8:01pm

As njuffa mentioned, maybe the CPU does not have enough PCIe lanes to support 2 GPUs at PCIe 3.0 x16.

There are only a few CPU which have 40 lanes, such as the i7 5930k 3.5 Ghz or the older i7 4820k 3.7 GHz. Newer popular CPU such as the 5820 and the 4790 do not have enough lanes to support 2 GPUs PCIe-3.0 x16.

Which exact CPU and which exact motherboard are you using? How many GPUs total?

741481546 · December 17, 2020, 8:44am

Hello. I’m reading a paper about dynamic managing GPU memory during training process of DNN(Capuchin,http://alchem.usc.edu/portal/static/download/capuchin.pdf). And it said ‘Second, because pinned memory transfer occupies unidirectional PCIe lane exclusively, a swap cannot start until its preceding swap finishes.’

So I’m wondering if I used pinned memory, can I transfer multiple array at the same time? If I can, then how does CUDA distribute the bandwidth of each transfer task?

njuffa · December 17, 2020, 11:02am

PCIe is a full-duplex interconnect. This means that if you have a GPU with at least two copy engines, a host->device transfer can occur concurrently with a device->host transfer.

However, at any given time there can only be one active transfer in a particular direction, so two transfers in the same direction will happen consecutively in issue order.

This is really orthogonal to transfers from/to pinned memory. Transfers from/to pageable memory basically also use a pinned memory buffer which however is internal to the driver, because transfers by DMA require a physically contiguous buffer. The overhead of transfers to pageable memory comes from needing an additional copy for moving data between user’s data space and this internal pinned buffer. The faster the host’s system memory, the less overhead there is.

Robert_Crovella · December 17, 2020, 3:07pm

A given transfer is referenced to a single pointer. If your multiple arrays are all referenced via a single pointer (e.g. they are contiguous in memory) then yes you can transfer them at the same time, and the behavior is no different than a single transfer of a large array. If your multiple arrays are referenced via multiple pointers (they are not contiguous, and also not arranged via strides) then it is not possible to transfer them “at the same time”. They require a separate transfer request for each array, and the behavior will be as described above by njuffa.

Topic		Replies	Views
How to transfer massive data efficiently? CUDA Programming and Performance	5	5828	April 16, 2015
Why is the transfer throughput low when transferring small size data from Host to Device (or Device to Host)? CUDA Programming and Performance	8	2152	October 12, 2021
New to CUDA having memory transfer issues CUDA Programming and Performance	16	1990	April 18, 2017
Improving data transfer performance from host to device CUDA Programming and Performance	2	2063	January 28, 2015
About Data transfer speed between CPU and GPU? How to increase the data transfer speed? CUDA Programming and Performance	7	15533	December 11, 2009
Highly varying copy throughput from/to pinned to/from pageable memory CUDA Programming and Performance cuda	9	1204	July 10, 2020
Question about Pinned memory CUDA Programming and Performance	8	1906	June 16, 2016
Fast processing of large amounts of pinned memory CUDA Programming and Performance	2	714	August 29, 2017
Optimize data transfer rate from host to device CUDA Programming and Performance	3	2737	July 27, 2017
Advantages/Disadvantages of using pinned memory CUDA Programming and Performance	6	13576	May 4, 2018

Why i can't use my full PCI-Express bandwidth?

Related topics