I currently have what I believe to be a pcie3 compatible motherboard, running Ubuntu :
As well as 8 titans. All pci transfer diagnostics indicate that I’m still running on PCIE2. Is there any way to force the system to use PCI Gen 3?
I currently have what I believe to be a pcie3 compatible motherboard, running Ubuntu :
As well as 8 titans. All pci transfer diagnostics indicate that I’m still running on PCIE2. Is there any way to force the system to use PCI Gen 3?
This should work:
replace ‘nvidia-313’ with the name of the nvidia module on your system, For ubuntu it could be:
nvidia, nvidia-current, nvidia-xxx (where xxx is the 3 digit version number) Try modinfo followed by the previous names and you’ll know what the module name is when you get the output of the current parameters.
Sweet!
With a bit of fiddling I got that working! Thanks!!!
Interesting though that it’s only reaching about 68% of peak (10659.7/15750):
dwidthTest$ ./bandwidthTest
[CUDA Bandwidth Test] - Starting...
Running on...
Device 0: GeForce GTX TITAN
Quick Mode
Host to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 10659.7
Device to Host Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 10649.5
Device to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 219901.0
Result = PASS
That’s quite similar to the performance I measured, around 11300 MB/s, on a Supermicro X9DRG (C602 chipset) and a Gigabyte GA-Z87X-OC (Z87 chipset), both with TITAN.
Looking deeper into it, it seems that cuda isn’t fully utilizing all of the pci lanes on my motherboard (have 8 cards on 16 pci lanes each)
Trying to do a ring of transfers : 0>1 1>2 2>3 3>4 4>5 5>6 6>7 7>0 results in:
cudaMemcpyPeer / cudaMemcpy bandwidth per gpu: 1.24GB/s
A partial transfer 0>1 2>3 4>5 6>7 gives
cudaMemcpyPeer / cudaMemcpy bandwidth per gpu: 2.33GB/s
Then 0>1 4>5 gives
cudaMemcpyPeer / cudaMemcpy bandwidth per gpu: 5.51GB/s
(these are on 2 completely separate Pci branches AND a separate cpu controls each transfer so they’re completely independent)
And 0>1 by itself
cudaMemcpyPeer / cudaMemcpy bandwidth per gpu: 11.80GB/s
In theory it should be no different to the 0>1 1>0 transfer bandwidth = ~10.5GB/s
Edit: attached a diagram of the motherboard setup. It should become obvious that 0>1 and 4>5 transfers have nothing to do with each other, and so should not be at all slowed down by one another.