Bios usage of dual cards

I’m looking into building a high-gpu-count (probably 12-16) rig.
From what I can tell, the bottleneck with these high card systems revolves around the bios space. In a bios (32bit), there are not enough resources to allocate more than 8-9 cards as each gpu takes up about 256MB of bios space. Does a dual card then take up 2256MB or does a dual card only take up 1256MB - essentially halving bios usage?

The BIOS has no knowledge of the concept of “cards”. It only sees devices in PCI config space, and attempts to allocate resources according to the requirements of each device. A “dual card” presents 2 devices in PCI config space. Although this is to some degree a function of the VBIOS on the card, generally speaking cards with 2 GPUs will consume twice as much resource as cards with a single GPU, all other things being equal. You can get an idea of this by using the following command in linux:

lspci -vvvv -xxxx

and studying the output (for a properly configured GPU). Here is example output from a Titan-Z card:

05:00.0 VGA compatible controller: NVIDIA Corporation Device 1001 (rev a1) (prog-if 00 [VGA controller])
Subsystem: NVIDIA Corporation Device 1078
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- SERR- <PERR- INTx-
Latency: 0
Interrupt: pin A routed to IRQ 128
Region 0: Memory at aa000000 (32-bit, non-prefetchable)
Region 1: Memory at a0000000 (64-bit, prefetchable)
Region 3: Memory at a8000000 (64-bit, prefetchable)
Region 5: I/O ports at 6000
[virtual] Expansion ROM at ab000000 [disabled]
Capabilities: [60] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [68] MSI: Enable+ Count=1/1 Maskable- 64bit+
Address: 00000000fee007f8 Data: 0000
Capabilities: [78] Express (v2) Endpoint, MSI 00
DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 <64us
ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
MaxPayload 256 bytes, MaxReadReq 512 bytes
DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
LnkCap: Port #8, Speed 8GT/s, Width x16, ASPM unknown, Latency L0 <512ns, L1 <4us
ClockPM+ Surprise- LLActRep- BwNot-
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk-
ExtSynch- ClockPM+ AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 2.5GT/s, Width x16, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Range AB, TimeoutDis+, LTR-, OBFF Not Supported
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
Compliance De-emphasis: -6dB
LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+, EqualizationPhase1+
EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest-
Capabilities: [100 v1] Virtual Channel
Caps: LPEVC=0 RefClk=100ns PATEntryBits=1
Arb: Fixed- WRR32- WRR64- WRR128-
Ctrl: ArbSelect=Fixed
Status: InProgress-
VC0: Caps: PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
Arb: Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256-
Ctrl: Enable+ ID=0 ArbSelect=Fixed TC/VC=01
Status: NegoPending- InProgress-
Capabilities: [128 v1] Power Budgeting <?>
Capabilities: [420 v2] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
AERCap: First Error Pointer: 00, GenCap- CGenEn- ChkCap- ChkEn-
Capabilities: [600 v1] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
Capabilities: [900 v1] #19
Kernel driver in use: nvidia
Kernel modules: nvidia, nouveau, nvidiafb
00: de … truncated

05:00.1 Audio device: NVIDIA Corporation GK110 HDMI Audio (rev a1)
Subsystem: NVIDIA Corporation Device 1078
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- SERR- <PERR- INTx-
Latency: 0, Cache Line Size: 64 bytes
Interrupt: pin B routed to IRQ 36
Region 0: Memory at ab080000 (32-bit, non-prefetchable)
Capabilities: [60] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
Address: 0000000000000000 Data: 0000
Capabilities: [78] Express (v2) Endpoint, MSI 00
DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 <64us
ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop+
MaxPayload 256 bytes, MaxReadReq 512 bytes
DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
LnkCap: Port #8, Speed 8GT/s, Width x16, ASPM unknown, Latency L0 <512ns, L1 <4us
ClockPM+ Surprise- LLActRep- BwNot-
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk-
ExtSynch- ClockPM+ AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 2.5GT/s, Width x16, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Range AB, TimeoutDis+, LTR-, OBFF Not Supported
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-
EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
Kernel driver in use: snd_hda_intel
Kernel modules: snd-hda-intel
00: de … truncated

06:00.0 3D controller: NVIDIA Corporation Device 1001 (rev a1)
Subsystem: NVIDIA Corporation Device 1078
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- SERR- <PERR- INTx-
Latency: 0
Interrupt: pin A routed to IRQ 131
Region 0: Memory at ac000000 (32-bit, non-prefetchable)
Region 1: Memory at 380ff0000000 (64-bit, prefetchable)
Region 3: Memory at 380ff8000000 (64-bit, prefetchable)
Region 5: I/O ports at 5000
Capabilities: [60] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [68] MSI: Enable+ Count=1/1 Maskable- 64bit+
Address: 00000000fee00b58 Data: 0000
Capabilities: [78] Express (v2) Endpoint, MSI 00
DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 <64us
ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
MaxPayload 256 bytes, MaxReadReq 512 bytes
DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
LnkCap: Port #16, Speed 8GT/s, Width x16, ASPM unknown, Latency L0 <512ns, L1 unlimited
ClockPM+ Surprise- LLActRep- BwNot-
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk-
ExtSynch- ClockPM+ AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 2.5GT/s, Width x16, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Range AB, TimeoutDis+, LTR-, OBFF Not Supported
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
Compliance De-emphasis: -6dB
LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+, EqualizationPhase1+
EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest-
Capabilities: [100 v1] Virtual Channel
Caps: LPEVC=0 RefClk=100ns PATEntryBits=1
Arb: Fixed- WRR32- WRR64- WRR128-
Ctrl: ArbSelect=Fixed
Status: InProgress-
VC0: Caps: PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
Arb: Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256-
Ctrl: Enable+ ID=0 ArbSelect=Fixed TC/VC=01
Status: NegoPending- InProgress-
Capabilities: [128 v1] Power Budgeting <?>
Capabilities: [420 v2] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
AERCap: First Error Pointer: 00, GenCap- CGenEn- ChkCap- ChkEn-
Capabilities: [600 v1] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
Capabilities: [900 v1] #19
Kernel driver in use: nvidia
Kernel modules: nvidia, nouveau, nvidiafb
00: de …truncated

The device at PCI address 5 is one of the two GPUs. The device at PCI address 6 is the other. (5.1 is the audio device built into the GPU, it can be ignored).

If we compare the resources for devices 5 and 6:

05:00.0 VGA compatible controller: NVIDIA Corporation Device 1001 (rev a1) (prog-if 00 [VGA

Interrupt: pin A routed to IRQ 128
Region 0: Memory at aa000000 (32-bit, non-prefetchable)
Region 1: Memory at a0000000 (64-bit, prefetchable)
Region 3: Memory at a8000000 (64-bit, prefetchable)

06:00.0 3D controller: NVIDIA Corporation Device 1001 (rev a1)

Interrupt: pin A routed to IRQ 131
Region 0: Memory at ac000000 (32-bit, non-prefetchable)
Region 1: Memory at 380ff0000000 (64-bit, prefetchable)
Region 3: Memory at 380ff8000000 (64-bit, prefetchable)

We see that the allocated resource requirements are basically the same between the two PCI devices.

Edit: double post - internet problems

OK Thanks,
And do you have any clue as to whether Tesla cards (i.e no video output) carry a smaller bios footprint than gamer cards?

It’s likely that they are larger. Which Tesla cards are you interested in? I have access to the server versions of M2050, M2070, M2090, K10, K20, K40 and I can post lspci dumps for you, if you want. I don’t have convenient access to the workstation versions (e.g. K20c, K40c) at this time, which is probably more what you’d be interested in, if you are building your own rig. I do have access to a C2075, but that’s probably not what you’re interested in, either.

Here’s output from a K40m (server version):

84:00.0 3D controller: NVIDIA Corporation Device 1023 (rev a1)
Subsystem: NVIDIA Corporation Device 097e
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- SERR- <PERR- INTx-
Latency: 0
Interrupt: pin A routed to IRQ 138
Region 0: Memory at fa000000 (32-bit, non-prefetchable)
Region 1: Memory at 3c1800000000 (64-bit, prefetchable)
Region 3: Memory at 3c1c00000000 (64-bit, prefetchable)

Note that Region 1 is G not M. (K40 has 12G on-board memory) To effectively use these devices, you need a system bios that is capable of mapping things above the 32-bit (4G) address boundary. Certainly UEFI bioses should be able to do that in principle. But I’m not saying I know how to build a 12 or 16 device rig, nor is it simply a matter of having any UEFI bios. I expect the K40c may look a little different in this respect, however I don’t have one conveniently available to check.

I expect to have access to a K40c in a few weeks.

For comparison purposes, below is the information from a K20c in an old HP xw8600 workstation running 64-bit RHEL. I have a C2050 that I could swap in if that information adds helpful information somehow.

80:00.0 3D controller: nVidia Corporation Unknown device 1022 (rev a1)
Subsystem: nVidia Corporation Unknown device 0982
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- SERR- <PERR-
Latency: 0
Interrupt: pin A routed to IRQ 146
Region 0: Memory at d2000000 (32-bit, non-prefetchable)
Region 1: Memory at c0000000 (64-bit, prefetchable)
Region 3: Memory at d0000000 (64-bit, prefetchable)

Very confusing.
Why does it need 64* the memory? Does that mean this card could only possibly run on uefi? I don’t really have much experience in this area.

njuffa, that won’t be necessary.
Have tested with a multitude of cards (including a C2070 and 2090) and they’re all yielding 128/256MB

Is there anything exceptional that needs to be done with enabling/configuring UEFI to enable compatibility with CUDA cards?

Current status : I have managed to get 8 K10s working (but the cards aren’t mine), but the 16 Titan Blacks won’t work - they start becoming unstable at about 10 cards. That’s why I opened this thread - I couldn’t understand why 8 K10s would work but 16 Titans wouldn’t. In any case, I can’t use 8 K10s because my workload for this rig will be double precision.

Would there be any reason why 8* Titan Z might work in place of the 16 Titan Blacks?
Would there happen to be anyone in nvidia I can get in touch with to push through the configuration stage?

Here’s an output for 8*K10s with 16 parallel copies vs 16 serial copies (thanks njuffa for your previous help!!)

memtest.cpp(97):Gpu access 0 -> 1 peer enabled? True
memtest.cpp(97):Gpu access 1 -> 2 peer enabled? True
memtest.cpp(97):Gpu access 2 -> 3 peer enabled? True
memtest.cpp(97):Gpu access 3 -> 4 peer enabled? True
memtest.cpp(97):Gpu access 4 -> 5 peer enabled? True
memtest.cpp(97):Gpu access 5 -> 6 peer enabled? True
memtest.cpp(97):Gpu access 6 -> 7 peer enabled? True
memtest.cpp(97):Gpu access 7 -> 8 peer enabled? True
memtest.cpp(97):Gpu access 8 -> 9 peer enabled? True
memtest.cpp(97):Gpu access 9 -> 10 peer enabled? True
memtest.cpp(97):Gpu access 10 -> 11 peer enabled? True
memtest.cpp(97):Gpu access 11 -> 12 peer enabled? True
memtest.cpp(97):Gpu access 12 -> 13 peer enabled? True
memtest.cpp(97):Gpu access 13 -> 14 peer enabled? True
memtest.cpp(97):Gpu access 14 -> 15 peer enabled? True
memtest.cpp(97):Gpu access 15 -> 0 peer enabled? True

memtest.cpp(133):One-at-a-time

memtest.cpp(142):Gpu 0 time taken : 0.090623 seconds
memtest.cpp(142):Gpu 1 time taken : 0.091742 seconds
memtest.cpp(142):Gpu 2 time taken : 0.090622 seconds
memtest.cpp(142):Gpu 3 time taken : 0.091297 seconds
memtest.cpp(142):Gpu 4 time taken : 0.090615 seconds
memtest.cpp(142):Gpu 5 time taken : 0.096312 seconds
memtest.cpp(142):Gpu 6 time taken : 0.090643 seconds
memtest.cpp(142):Gpu 7 time taken : 0.105096 seconds
memtest.cpp(142):Gpu 8 time taken : 0.090598 seconds
memtest.cpp(142):Gpu 9 time taken : 0.099374 seconds
memtest.cpp(142):Gpu 10 time taken : 0.090636 seconds
memtest.cpp(142):Gpu 11 time taken : 0.090867 seconds
memtest.cpp(142):Gpu 12 time taken : 0.090605 seconds
memtest.cpp(142):Gpu 13 time taken : 0.093097 seconds
memtest.cpp(142):Gpu 14 time taken : 0.092471 seconds
memtest.cpp(142):Gpu 15 time taken : 0.112878 seconds

memtest.cpp(146):Total time taken : 1.508611 seconds, bandwidth = 10.605782 GB/s, bandwidth per gpu : 10.605789 GB/s

memtest.cpp(149):Parallel

memtest.cpp(154):Total time taken : 0.100144 seconds, bandwidth = 159.769931 GB/s, bandwidth per gpu : 9.985621 GB/s

To my limited knowledge, getting many GPUs to run in a single system is largely a matter of system configuration, so you may want to discuss this with your system vendor. There have been various reports in these forums of users constructing system with a large number of GPUs. Maybe some of them will chime in here to provide advice, or you could consider contacting them by PM through the forums. The most recent record I am aware of is this report of a system with 18 GPUs:

https://devtalk.nvidia.com/default/topic/649542/18-gpus-in-a-single-rig-and-it-works/

I am curious as to what use cases can benefit from a very high GPU count in a single system.

Actually I have been basing my build off that thread (Which indeed you had linked me a few months ago :) ). I find it very strange though, that 8 * dual cards in the form of K10s using exactly the same hardware (with 2 pcie expansion boxes) works but 16*Titan Blacks (using 4 pcie expansion boxes) fails.

The motivation for the 16 gpu box is that a cross-gpu reduce+broadcast of an array (typically of length about 32*1e6) occurring twice per iteration takes up about 40% of processing time currently (that’s using 8 Titans) - so switching to any form of clustering, whereby cross-gpu communication has to take place on infiniband etc, will incur a huge performance penalty, unless I’m mistaken. If this is not the case, I’d love to hear about it, it’d make my job about a hundred times easier!

Sorry, I must have forgotten the previous interaction, didn’t mean to go in circles. Based on your use case, it seems like dual-GPU modules like the K10 should generally be superior to using twice the number of single-GPU modules, due to effects of PCIe bandwidth sharing? I have no insights into issues that may arise from the use of PCIe expansion boxes as I have never used one.

I would use K10s except for the fact the workload is double precision. I’m looking at possibly acquiring Titan Zs but I’m trying to narrow down why it’s not working, and it’s hard to justify the purchase unless I know for a fact it would work.

All I know is
16 gpus worth of K10s (i.e 8 K10s) works
16 gpus worth of Titan Blacks doesn’t

So, as far as I can tell the problem is either

  1. A difference between Teslas and gamer gpus
  2. A difference between dual cards and single cards

Is there any difference between the Tesla line and the gamer (Titan) line that MAY be the difference in outcomes?

Same question for dual gpus vs single gpus?

I’m really at a loss here!!!

Max power consumption of a K10 is 225W. That’s up to 1800W in GPU (8x225). Max power consumption of a Titan Black is 250W. That’s up to 4000W in GPU (16 x 250). Max power consumption of Titan Z is 375W. That’s up to 3000W in GPU (8x375).

8 K10s have a total memory of 64 GB.
16 Titan Blacks have 96 GB.

Assuming well-built systems and identical mainboards/bioses, address space usage may make the difference.

You may possibly be interested in this webinar, particularly at around 27 minutes in, where it starts to talk about PCI BAR management on GPUs.

http://on-demand.gputechconf.com/gtc/2014/webinar/gtc-express-gpudirect-webinar.mp4

You may possibly be interested in this webinar, particularly at around 27 minutes in, where it starts to talk about PCI BAR management on GPUs.

http://on-demand.gputechconf.com/gtc/2014/webinar/gtc-express-gpudirect-webinar.mp4

Thanks for the help guys.
This is where I’m up to:
The bios is saying insufficient pci resources and suggests to enable “Above 4G Decoding” even though it is enabled. This error does NOT occur when the 8*K10s are plugged in, funnily enough.

Sounds like your motherboard BIOS is not up to the task you are asking of it. Certainly if you cannot get it to properly allocate resources, there isn’t anything that can be done on the GPU side (other than removing them). Sounds like you would need to pursue it with the motherboard vendor who has control over the BIOS, or else find another motherboard.

Just an update:
Have now got setups with 8,12,16 cards booting, however pcie performance really starts to struggle (even in one-at-a-time transfers)
as below:

Computer has 8 cards plugged in:

memtest.cpp(133):One-at-a-time
memtest.cpp(142):Gpu 0 time taken : 0.085042 seconds
memtest.cpp(142):Gpu 1 time taken : 0.085025 seconds
memtest.cpp(142):Gpu 2 time taken : 0.085029 seconds
memtest.cpp(142):Gpu 3 time taken : 0.105682 seconds
memtest.cpp(142):Gpu 4 time taken : 0.085568 seconds
memtest.cpp(142):Gpu 5 time taken : 0.085031 seconds
memtest.cpp(142):Gpu 6 time taken : 0.085027 seconds
memtest.cpp(142):Gpu 7 time taken : 0.106266 seconds
memtest.cpp(146):Total time taken : 0.722748 seconds, bandwidth = 11.068865 GB/s, bandwidth per gpu : 11.068865 GB/s
memtest.cpp(149):Parallel
memtest.cpp(154):Total time taken : 0.115923 seconds, bandwidth = 69.011326 GB/s, bandwidth per gpu : 8.626416 GB/s

12 cards plugged in:

memtest.cpp(133):One-at-a-time
memtest.cpp(142):Gpu 0 time taken : 0.160168 seconds
memtest.cpp(142):Gpu 1 time taken : 0.160743 seconds
memtest.cpp(142):Gpu 2 time taken : 0.159879 seconds
memtest.cpp(142):Gpu 3 time taken : 0.160694 seconds
memtest.cpp(142):Gpu 4 time taken : 0.160585 seconds
memtest.cpp(142):Gpu 5 time taken : 0.160118 seconds
memtest.cpp(142):Gpu 6 time taken : 0.159880 seconds
memtest.cpp(142):Gpu 7 time taken : 0.160816 seconds
memtest.cpp(142):Gpu 8 time taken : 0.084995 seconds
memtest.cpp(142):Gpu 9 time taken : 0.085006 seconds
memtest.cpp(142):Gpu 10 time taken : 0.159877 seconds
memtest.cpp(142):Gpu 11 time taken : 0.161290 seconds
memtest.cpp(146):Total time taken : 1.774759 seconds, bandwidth = 6.761481 GB/s, bandwidth per gpu : 6.761485 GB/s
memtest.cpp(149):Parallel
memtest.cpp(154):Total time taken : 0.163123 seconds, bandwidth = 73.564120 GB/s, bandwidth per gpu : 6.130343 GB/s

16 cards plugged in:

memtest.cpp(133):One-at-a-time
memtest.cpp(142):Gpu 0 time taken : 0.160282 seconds
memtest.cpp(142):Gpu 1 time taken : 0.160721 seconds
memtest.cpp(142):Gpu 2 time taken : 0.160046 seconds
memtest.cpp(142):Gpu 3 time taken : 1.504637 seconds
memtest.cpp(142):Gpu 4 time taken : 0.159973 seconds
memtest.cpp(142):Gpu 5 time taken : 0.160622 seconds
memtest.cpp(142):Gpu 6 time taken : 0.160677 seconds
memtest.cpp(142):Gpu 7 time taken : 0.160211 seconds
memtest.cpp(142):Gpu 8 time taken : 0.160718 seconds
memtest.cpp(142):Gpu 9 time taken : 0.160511 seconds
memtest.cpp(142):Gpu 10 time taken : 0.159971 seconds
memtest.cpp(142):Gpu 11 time taken : 0.185190 seconds
memtest.cpp(142):Gpu 12 time taken : 0.160282 seconds
memtest.cpp(142):Gpu 13 time taken : 0.160577 seconds
memtest.cpp(142):Gpu 14 time taken : 0.160278 seconds
memtest.cpp(142):Gpu 15 time taken : 0.754087 seconds
memtest.cpp(146):Total time taken : 4.530811 seconds, bandwidth = 3.531378 GB/s, bandwidth per gpu : 3.531381 GB/s
memtest.cpp(149):Parallel
memtest.cpp(154):Total time taken : 1.187767 seconds, bandwidth = 13.470667 GB/s, bandwidth per gpu : 0.841917 GB/s

So it definitely is possible to get 16 working, but something strange appears to go on in the background. The fact that 14 other gpus are connected can still have an influence on the transfer time even if nothing else is happening at that point.