yes, could be the same issue … I’ll try 370 too and see what happens … nope, didn’t help here.
I installed the 370.23 driver rebuilt with both cuda 7.5 and 8
CUDA 8 build gave this:
1 Devices used for simulation
= 7440.199 single-precision GFLOP/s at 20 flops per interaction
2 Devices used for simulation
= 4721.444 single-precision GFLOP/s at 20 flops per interaction
3 Devices used for simulation
= 7935.839 single-precision GFLOP/s at 20 flops per interaction
4 Devices used for simulation
= 11814.144 single-precision GFLOP/s at 20 flops per interaction
Whenever dual-socket systems are used, make sure to carefully the control memory and CPU affinity (e.g. with numactl), so each GPU talks to the “near” CPU and “near” system memory.
Note that not all CPUs can provide the >= 32 PCIe lanes that are needed to drive two GPUs from one socket at full PCIe gen3 x16 rates. What CPU(s) are you using?
Depending on the intensity of memory traffic between GPUs and the system, system memory could possibly put the brakes on, so it would probably be best to use a fast DDR4 configuration. Two Titan X coupled to one CPU socket could provide up to 50 GB/sec of memory traffic when operating a maximum full-duplex throughput. I am speaking theoretically here as I don’t have hands-on experience with dual socket machines with four Titan Xs.
…yes, … but this is an ASUS D8 WS board full X16 support for 4 cards and 2 x E52690v4 CPU’s etc…
this simple sample code should should work fine. I mainly wanted to see if there was some issue specific to the ASUS X99-E WS board since it is sometimes problematic. (and the default DIGITS box motherboard)
The biggest puzzle is this: Why is scaling so bad with 2 TitanX Pascal cards?
I have tried 1, 2 and 4 card setups on good single and dual socket MB’s with GTX 1070, 1080 and TitanX (Pascal) cards. The 1070 and 1080 scale as expected but the TitanX is very odd and inconsistent.
I’m mostly putting this out there so people can find it if they are running into this. Hopefully someone better versed and with more time than me will have some enlightening info.
I definitively will be doing more testing! There are a lot of people ordering systems right now for a variety of “machine learning” tasks. They have been waiting for Titan X and are ordering with 4 card setups … and doing important work! I am very concerned!
I will try to keep this thread alive with more info. Any comments are appreciated. Thanks! --Don
Thanks for the links! The nccl stuff look very interesting, I’ll try the nccl benchmark
I will try p2pBandwidthLatency test right now …
(this sys (X99-E WS) may have a problem on pciBusID 9 I have had “GPU has fallen off the bus” for this ID …doing hardware debugging. That is unfortunately complicating the scaling issue which seems to be consistant across another X99 E WS board and the Z10PE-D9 WS )
[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, TITAN X (Pascal), pciBusID: a, pciDeviceID: 0, pciDomainID:0
Device: 1, TITAN X (Pascal), pciBusID: 9, pciDeviceID: 0, pciDomainID:0
Device: 2, TITAN X (Pascal), pciBusID: 6, pciDeviceID: 0, pciDomainID:0
Device: 3, TITAN X (Pascal), pciBusID: 5, pciDeviceID: 0, pciDomainID:0
Device=0 CAN Access Peer Device=1
Device=0 CAN Access Peer Device=2
Device=0 CAN Access Peer Device=3
Device=1 CAN Access Peer Device=0
Device=1 CAN Access Peer Device=2
Device=1 CAN Access Peer Device=3
Device=2 CAN Access Peer Device=0
Device=2 CAN Access Peer Device=1
Device=2 CAN Access Peer Device=3
Device=3 CAN Access Peer Device=0
Device=3 CAN Access Peer Device=1
Device=3 CAN Access Peer Device=2
***NOTE: In case a device doesn’t have P2P access to other one, it falls back to normal memcopy procedure.
So you can see lesser Bandwidth (GB/s) in those cases.