I am trying to set up a Tesla K40 environment for Deep Learning. The server that I am working on have one x16 PCIE slot. I am using a riser cable to connect the GPU to the PCIE slot.
The problem is that the GPU is only getting detected at certain times. The connection is not stable.
When I do lspci | grep -i nvidia, if the GPU is shown, once I restart the server it will not be detecting anymore. Further trials with lspci | grep -i nvidia, after many reboots and changing riser cables and power cables, mostly won’t work.
The riser cable works fine for all other Quadro GPU’s that I tested out.
I checked/enabled the 4G setting in BIOS. Still the GPU does not get detected always.
I am using a Cooler Master 750G2 to power the GPU using both the 8 and the 6 pin connectors. There is enough power for the GPU and I don’t think that it is a power related issue. I drive the server with another PSU entirely.
My initial doubts are these:
Riser cable damaged ?
My Guess: NO. The cables work fine with other Quadro Cards.
PCI Gen 3 ?
My understanding: Server’s PCIe x16 is definitely PCI Gen 3. I doubt whether the riser cable has a part to do with this.
Way to make sure GPU gets enough power ?
I do not know how to test whether the GPU gets enough juice. Any help in this regard will be appreciated.
Is there anything that I am missing out ? Could you guys shed some light ?
Is it a K40m or a K40c? Meaning, does the K40 have a fan on it (K40c) or not (K40m)?
If it is a K40m (no fan) then the GPU may be overheating.
In either case, the link may be attempting to train at PCIE Gen3, and the riser cable may not be supporting that data rate. If your other Quadro cards happen to be training at Gen2 or Gen1 speeds, the cable may be OK for that but not for Gen3 speeds.
Thank You for the quick reply !
It is an active version and the GPU is in an open space, 70F always.
I agree with you on the cable, I surely suspect that. Do you personally use/know any riser cables that can support Tesla’s ?
I have never used such a riser cable, but I have seen reports of riser cables causing unreliable operation with GPUs, including the K40 in particular.
(1) High-speed interconnects, such as PCIe gen 3, tend to suffer from signal integrity issues as physical connections get longer. Also, cables can act as antennas picking up electrical noise from electro-magnetic emitters. So you would want your cable (a) as short as possible (b) routed as straight as possible © shielded if possible
(2) PCIe devices, by design and in compliance with the PCIe specification, draw up to 75 watts through the PCIe slot, and it is possible that a riser cable interferes with that. There are so-called “powered” riser cables designed to address that, you may want to research that (I have no practical experience and don’t know whether powered riser cables actually provide an advantage).
(3) There are enthusiasts that build jury-rigged PC with many GPUs, for the mining of crypto coins for example. You may be able to get good advice from people with practical experience on forums frequented by that crowd (sorry, I wouldn’t know what they are).
Thank You @njuffa. I would definitely check those out.