PCI Bandwidth

fayard · January 12, 2018, 8:43pm

Hi,

I am new to CUDA programming and my first tests are a few simple benchmarks. I have been puzzled with some benchmarks where I copy data from host to device and device to host where it seems I only have about 3 GB/s of bandwidth.

My hardware is as follow:

Dell Precision 7820 tower
Dual Xeon Skylake 6150, 18 cores per socket, 6 DIMMs of 2666 MHz DDR4 memory per socket, for a total of 12 x 8 = 96 GB
Nvidia P1000
Nvidia Titan V

The P1000 is used as a graphic card where 2 screens are connected and the Titan V is used for number crunching and is the one I use when I only get 3 GB/s. As I am new to PCI devices, and I am the one who plugged in the Titan V, my guess is that I did something suboptimal here. On my machine, I get 5 PCI slots that say:

Slot 1: PCIe3x16 (8, 4, 1)
Slot 2: PCIe3x16 75W <---- P1000
Slot 3: PCIe3x16 (1)
Slot 4: PCIe3x16 75W
Slot 5: PCIe3x16 (4, 1) <---- Titan V

The arrows show where my GPU are currently plugged in. What should I do to get a better PCI bandwidth on the Titan V?

njuffa · January 12, 2018, 9:40pm

It seems to me your Titan V should go into slot 4 (since slot 2 and slot 4 appear to be the only slots with full x16 interface based on the information provided). You are currently getting 3 GB/sec because the Titan V is in a x4 slot.

fayard · January 12, 2018, 10:49pm

Thanks for your help.

I have changed the Titan V card which is now in the Slot 4. Now the results give about 5 GB/s for the bandwidth to the device and for the bandwidth from the device. But from time to time, I get about 10 GB/s for one or the other.

I guess, I have NUMA effects because I have a Dual Socket Workstation. Can anyone confirm this? If so, how do I allocate the memory to the RAM that is closer to the GPU?

What if the best bandwidth I should get when everything is ok on such a configuration?

njuffa · January 12, 2018, 10:56pm

The Skylake 6150 CPU is listed at Intel ARK as providing 48 PCIe lanes, which should be sufficient to drive two GPUs at full PCIe gen3 x16 speed of about 12 GB/sec, provided they are in the correct slot.

It is not clear how well controlled your experiments are. Is the performance data you quoted from the bandwidthTest app that ships with CUDA? Are you using transfers from and to pinned host memory? If not (i.e. you use pageable host memory), each transfer across PCIe also triggers an additional system-memory to system-memory transfer, which will skew results towards lower numbers.

Also, PCIe data transport is packetized, meaning you will get better throughput for larger transfers; often it takes a transfer size of >= 8 MB to reach peak throughput.

Yes, NUMA issues in dual-socket systems can reduce effective throughput. You would want each GPU to talk to the “near” CPU with it’s attached “near” memory to maximize throughput. Use numactl to bind CPUs and memory appropriately.

fayard · January 12, 2018, 11:20pm

I wrote by own program where I transfer 1 GB of memory. This is basically:

const std::size_t kilo = 1024;
const std::size_t n = kilo * kilo * kilo;
unsigned char* p = new unsigned char[n];
for (std::size_t i = 0; i < n; ++i) {
p[i] = 0;
}
unsigned char* q = cudaMalloc((void**)q, n);
// StartTimer
cudaMemcpy(q, p, n, cudaMemcpyHostToDevice);
// StopTimer

I have just found the bandwidthTest program in the sample directory. It gives me 12 GB/s from host to device and 13 GB/s from device to host. So I guess everything is fine now and I need to learn all those pinned memory stuff.

Thanks for your help.

Topic		Replies	Views
Bandwidht Usage CUDA Programming and Performance	16	8894	October 30, 2008
Bandwidth speed to and from GPU about 1/4 of PCIe4 x16 on Debian (single GPU system) CUDA Programming and Performance	7	1020	January 7, 2024
Bandwidth problem ? Could anyone verify that this is normal? CUDA Programming and Performance	7	3579	April 25, 2008
Weird bandwidth issues CUDA Programming and Performance	8	1363	December 1, 2016
Host<-> device bandwidth problems slow and intermittent bandwidth on linux CUDA Programming and Performance	9	6708	January 8, 2008
Optimize data transfer rate from host to device CUDA Programming and Performance	3	2722	July 27, 2017
About Data transfer speed between CPU and GPU? How to increase the data transfer speed? CUDA Programming and Performance	7	15528	December 11, 2009
The change of speed when copying data between host and device CUDA Programming and Performance pcie , cuda , linux	5	1926	October 12, 2021
New to CUDA having memory transfer issues CUDA Programming and Performance	16	1989	April 18, 2017
Memory bandwidth CUDA Programming and Performance	31	38405	October 5, 2007

PCI Bandwidth

Related topics