PCIE3 on Titans

I currently have what I believe to be a pcie3 compatible motherboard, running Ubuntu :

As well as 8 titans. All pci transfer diagnostics indicate that I’m still running on PCIE2. Is there any way to force the system to use PCI Gen 3?

This should work:

[url]https://devtalk.nvidia.com/default/topic/533200/linux/gtx-titan-drivers-for-linux-32-64-bit-release-/post/3753244/#3753244[/url]

replace ‘nvidia-313’ with the name of the nvidia module on your system, For ubuntu it could be:
nvidia, nvidia-current, nvidia-xxx (where xxx is the 3 digit version number) Try modinfo followed by the previous names and you’ll know what the module name is when you get the output of the current parameters.

Sweet!
With a bit of fiddling I got that working! Thanks!!!

Interesting though that it’s only reaching about 68% of peak (10659.7/15750):

dwidthTest$ ./bandwidthTest
[CUDA Bandwidth Test] - Starting...
Running on...

 Device 0: GeForce GTX TITAN
 Quick Mode

 Host to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(MB/s)
   33554432			10659.7

 Device to Host Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(MB/s)
   33554432			10649.5

 Device to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(MB/s)
   33554432			219901.0

Result = PASS

That’s quite similar to the performance I measured, around 11300 MB/s, on a Supermicro X9DRG (C602 chipset) and a Gigabyte GA-Z87X-OC (Z87 chipset), both with TITAN.

Looking deeper into it, it seems that cuda isn’t fully utilizing all of the pci lanes on my motherboard (have 8 cards on 16 pci lanes each)

Trying to do a ring of transfers : 0>1 1>2 2>3 3>4 4>5 5>6 6>7 7>0 results in:
cudaMemcpyPeer / cudaMemcpy bandwidth per gpu: 1.24GB/s

A partial transfer 0>1 2>3 4>5 6>7 gives
cudaMemcpyPeer / cudaMemcpy bandwidth per gpu: 2.33GB/s

Then 0>1 4>5 gives
cudaMemcpyPeer / cudaMemcpy bandwidth per gpu: 5.51GB/s
(these are on 2 completely separate Pci branches AND a separate cpu controls each transfer so they’re completely independent)

And 0>1 by itself
cudaMemcpyPeer / cudaMemcpy bandwidth per gpu: 11.80GB/s

In theory it should be no different to the 0>1 1>0 transfer bandwidth = ~10.5GB/s

Edit: attached a diagram of the motherboard setup. It should become obvious that 0>1 and 4>5 transfers have nothing to do with each other, and so should not be at all slowed down by one another.