I am new to CUDA programming and my first tests are a few simple benchmarks. I have been puzzled with some benchmarks where I copy data from host to device and device to host where it seems I only have about 3 GB/s of bandwidth.
My hardware is as follow:
- Dell Precision 7820 tower
- Dual Xeon Skylake 6150, 18 cores per socket, 6 DIMMs of 2666 MHz DDR4 memory per socket, for a total of 12 x 8 = 96 GB
- Nvidia P1000
- Nvidia Titan V
The P1000 is used as a graphic card where 2 screens are connected and the Titan V is used for number crunching and is the one I use when I only get 3 GB/s. As I am new to PCI devices, and I am the one who plugged in the Titan V, my guess is that I did something suboptimal here. On my machine, I get 5 PCI slots that say:
- Slot 1: PCIe3x16 (8, 4, 1)
- Slot 2: PCIe3x16 75W <---- P1000
- Slot 3: PCIe3x16 (1)
- Slot 4: PCIe3x16 75W
- Slot 5: PCIe3x16 (4, 1) <---- Titan V
The arrows show where my GPU are currently plugged in. What should I do to get a better PCI bandwidth on the Titan V?