p2p error in the driver make kernel unstability

Hi I’m working in a RH7 node with 8 K20 and using the sdk RC7.5. Executing an official sample of p2p (p2pBandwidthLatencyTest) the process freeze in the p2p comunication. In that point (waiting 10 min) if I kill the process … The system get in an unstability state and I have to reboot. What is it wrong??? What can I do??

Here is the output of the sample:

[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, Tesla K20m, pciBusID: 4, pciDeviceID: 0, pciDomainID:0
Device: 1, Tesla K20m, pciBusID: 5, pciDeviceID: 0, pciDomainID:0
Device: 2, Tesla K20m, pciBusID: 8, pciDeviceID: 0, pciDomainID:0
Device: 3, Tesla K20m, pciBusID: 9, pciDeviceID: 0, pciDomainID:0
Device: 4, Tesla K20m, pciBusID: 83, pciDeviceID: 0, pciDomainID:0
Device: 5, Tesla K20m, pciBusID: 84, pciDeviceID: 0, pciDomainID:0
Device: 6, Tesla K20m, pciBusID: 87, pciDeviceID: 0, pciDomainID:0
Device: 7, Tesla K20m, pciBusID: 88, pciDeviceID: 0, pciDomainID:0
Device=0 CAN Access Peer Device=1
Device=0 CAN Access Peer Device=2
Device=0 CAN Access Peer Device=3
Device=0 CANNOT Access Peer Device=4
Device=0 CANNOT Access Peer Device=5
Device=0 CANNOT Access Peer Device=6
Device=0 CANNOT Access Peer Device=7
Device=1 CAN Access Peer Device=0
Device=1 CAN Access Peer Device=2
Device=1 CAN Access Peer Device=3
Device=1 CANNOT Access Peer Device=4
Device=1 CANNOT Access Peer Device=5
Device=1 CANNOT Access Peer Device=6
Device=1 CANNOT Access Peer Device=7
Device=2 CAN Access Peer Device=0
Device=2 CAN Access Peer Device=1
Device=2 CAN Access Peer Device=3
Device=2 CANNOT Access Peer Device=4
Device=2 CANNOT Access Peer Device=5
Device=2 CANNOT Access Peer Device=6
Device=2 CANNOT Access Peer Device=7
Device=3 CAN Access Peer Device=0
Device=3 CAN Access Peer Device=1
Device=3 CAN Access Peer Device=2
Device=3 CANNOT Access Peer Device=4
Device=3 CANNOT Access Peer Device=5
Device=3 CANNOT Access Peer Device=6
Device=3 CANNOT Access Peer Device=7
Device=4 CANNOT Access Peer Device=0
Device=4 CANNOT Access Peer Device=1
Device=4 CANNOT Access Peer Device=2
Device=4 CANNOT Access Peer Device=3
Device=4 CAN Access Peer Device=5
Device=4 CAN Access Peer Device=6
Device=4 CAN Access Peer Device=7
Device=5 CANNOT Access Peer Device=0
Device=5 CANNOT Access Peer Device=1
Device=5 CANNOT Access Peer Device=2
Device=5 CANNOT Access Peer Device=3
Device=5 CAN Access Peer Device=4
Device=5 CAN Access Peer Device=6
Device=5 CAN Access Peer Device=7
Device=6 CANNOT Access Peer Device=0
Device=6 CANNOT Access Peer Device=1
Device=6 CANNOT Access Peer Device=2
Device=6 CANNOT Access Peer Device=3
Device=6 CAN Access Peer Device=4
Device=6 CAN Access Peer Device=5
Device=6 CAN Access Peer Device=7
Device=7 CANNOT Access Peer Device=0
Device=7 CANNOT Access Peer Device=1
Device=7 CANNOT Access Peer Device=2
Device=7 CANNOT Access Peer Device=3
Device=7 CAN Access Peer Device=4
Device=7 CAN Access Peer Device=5
Device=7 CAN Access Peer Device=6

***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure.
So you can see lesser Bandwidth (GB/s) in those cases.

P2P Cliques:
[0 1 2 3]
[4 5 6 7]
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3      4      5      6      7
     0  74.90   3.53   4.46   4.51   4.65   4.75   4.78   4.78
     1   3.54  73.09   3.81   3.81   4.83   4.85   4.83   4.84
     2   6.03   6.02  74.30   6.02   5.22   5.16   5.16   5.14
     3   5.72   5.81   5.77  74.37   4.68   4.68   4.67   4.68
     4   4.96   4.92   4.95   4.90  74.39   3.32   3.59   3.55
     5   4.62   4.62   4.61   4.62   5.35  73.22   5.45   5.56
     6   4.79   4.77   4.79   4.77   3.81   3.77  74.25   2.90
     7   4.94   5.08   5.13   5.08   3.59   3.62   3.44  74.34
Unidirectional P2P=Enabled Bandwidth Matrix (GB/s)

Here is the topology

nvidia-smi topo -m
        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    CPU Affinity
GPU0     X      PIX     PHB     PHB     SOC     SOC     SOC     SOC     0-5,12-17
GPU1    PIX      X      PHB     PHB     SOC     SOC     SOC     SOC     0-5,12-17
GPU2    PHB     PHB      X      PIX     SOC     SOC     SOC     SOC     0-5,12-17
GPU3    PHB     PHB     PIX      X      SOC     SOC     SOC     SOC     0-5,12-17
GPU4    SOC     SOC     SOC     SOC      X      PIX     PHB     PHB     6-11,18-23
GPU5    SOC     SOC     SOC     SOC     PIX      X      PHB     PHB     6-11,18-23
GPU6    SOC     SOC     SOC     SOC     PHB     PHB      X      PIX     6-11,18-23
GPU7    SOC     SOC     SOC     SOC     PHB     PHB     PIX      X      6-11,18-23

Legend:

  X   = Self
  SOC = Path traverses a socket-level link (e.g. QPI)
  PHB = Path traverses a PCIe host bridge
  PXB = Path traverses multiple PCIe internal switches
  PIX = Path traverses a PCIe internal switch

It’s possible that the PCIE switches in your system are not proper set up for P2P communication. For P2P to work correctly, specific settings are needed in the PLX bridge chips.

Try checking with your system manufacturer for a BIOS update.

How could I know if I have that problem of PCIE firmware version ???

this is weird. I have downgrade the sdk and the driver to 7.0. I have the same problem, but the smaple tell me that communication capability of p2p is diferent

./p2pBandwidthLatencyTest
[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, Tesla K20m, pciBusID: 4, pciDeviceID: 0, pciDomainID:0
Device: 1, Tesla K20m, pciBusID: 5, pciDeviceID: 0, pciDomainID:0
Device: 2, Tesla K20m, pciBusID: 8, pciDeviceID: 0, pciDomainID:0
Device: 3, Tesla K20m, pciBusID: 9, pciDeviceID: 0, pciDomainID:0
Device: 4, Tesla K20m, pciBusID: 83, pciDeviceID: 0, pciDomainID:0
Device: 5, Tesla K20m, pciBusID: 84, pciDeviceID: 0, pciDomainID:0
Device: 6, Tesla K20m, pciBusID: 87, pciDeviceID: 0, pciDomainID:0
Device: 7, Tesla K20m, pciBusID: 88, pciDeviceID: 0, pciDomainID:0
Device=0 CAN Access Peer Device=1
Device=0 CAN Access Peer Device=2
Device=0 CAN Access Peer Device=3
Device=0 CAN Access Peer Device=4
Device=0 CAN Access Peer Device=5
Device=0 CAN Access Peer Device=6
Device=0 CAN Access Peer Device=7
Device=1 CAN Access Peer Device=0
Device=1 CAN Access Peer Device=2
Device=1 CAN Access Peer Device=3
Device=1 CAN Access Peer Device=4
Device=1 CAN Access Peer Device=5
Device=1 CAN Access Peer Device=6
Device=1 CAN Access Peer Device=7
Device=2 CAN Access Peer Device=0
Device=2 CAN Access Peer Device=1
Device=2 CAN Access Peer Device=3
Device=2 CAN Access Peer Device=4
Device=2 CAN Access Peer Device=5
Device=2 CAN Access Peer Device=6
Device=2 CAN Access Peer Device=7
Device=3 CAN Access Peer Device=0
Device=3 CAN Access Peer Device=1
Device=3 CAN Access Peer Device=2
Device=3 CAN Access Peer Device=4
Device=3 CAN Access Peer Device=5
Device=3 CAN Access Peer Device=6
Device=3 CAN Access Peer Device=7
Device=4 CAN Access Peer Device=0
Device=4 CAN Access Peer Device=1
Device=4 CAN Access Peer Device=2
Device=4 CAN Access Peer Device=3
Device=4 CAN Access Peer Device=5
Device=4 CAN Access Peer Device=6
Device=4 CAN Access Peer Device=7
Device=5 CAN Access Peer Device=0
Device=5 CAN Access Peer Device=1
Device=5 CAN Access Peer Device=2
Device=5 CAN Access Peer Device=3
Device=5 CAN Access Peer Device=4
Device=5 CAN Access Peer Device=6
Device=5 CAN Access Peer Device=7
Device=6 CAN Access Peer Device=0
Device=6 CAN Access Peer Device=1
Device=6 CAN Access Peer Device=2
Device=6 CAN Access Peer Device=3
Device=6 CAN Access Peer Device=4
Device=6 CAN Access Peer Device=5
Device=6 CAN Access Peer Device=7
Device=7 CAN Access Peer Device=0
Device=7 CAN Access Peer Device=1
Device=7 CAN Access Peer Device=2
Device=7 CAN Access Peer Device=3
Device=7 CAN Access Peer Device=4
Device=7 CAN Access Peer Device=5
Device=7 CAN Access Peer Device=6

***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure.
So you can see lesser Bandwidth (GB/s) in those cases.

P2P Cliques:
[0 1 2 3 4 5 6 7]
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3      4      5      6      7
     0  74.55   5.56   5.58   5.58   5.00   5.00   5.00   4.97
     1   5.62  74.39   5.60   5.62   5.23   5.24   5.26   5.25
     2   5.60   5.60  74.38   5.61   5.19   5.33   5.21   5.26
     3   5.39   5.40   5.38  74.62   5.00   5.03   5.02   5.04
     4   4.79   4.77   4.76   4.73  74.51   3.59   4.01   4.00
     5   4.86   4.90   4.83   4.80   4.42  74.44   4.74   4.79
     6   5.06   4.98   4.90   4.93   4.19   4.22  74.23   3.81
     7   4.68   4.71   4.75   4.74   6.04   6.03   6.01  74.60
Unidirectional P2P=Enabled Bandwidth Matrix (GB/s)