I am attempting to create compressed Chia plots and the GPU keeps falling off the bus.
nvidia-bug-report.log.gz (215.8 KB)
Check whether (1) there is insufficient power supply to the GPU (2) cooling is insufficient (problem could be dust clogging the fan/heat sink assembly) (3) GPU is not properly seated and secured in PCIe slot.
If this is a GPU inside a laptop, double-check power-saving settings. In my experience, some of the power-saving mechanisms can lead to system instability. Are you using an HDMI dongle with this GPU, by any chance?
Along the lines of njuffa’s reply, have you tried stressing the GPU with other programs. Run Furmark or similar for a while and see if that also causes a crash.
The GPU is in a 4U server case with server motherboard. I have added extra cooling fans to the front and back of the card. It falls off the bus at about 75C.
I do not have anything attached to the card. No HDMI dongle. Running in headless mode.
Running at 75 deg C is OK. The normal operating temperature for GPUs before thermal throttling kicks in is typically up to 83 deg C.
How about the power supply to the GPU? I mentioned that first because it is the most common problem. For the GTX 1080 Ti PCIe auxilliary power should be supplied by one 6-pin plus one 8-pin power connector. Make sure these cables are properly plugged in. We don’t know the configuration of your server; make sure the power supply is dimensioned correctly. A conservative rule of thumb for rock-solid stability over the life-time of the system: the nominal power of all system components combined should not exceed 60% of the nominal wattage of the PSU.
I have not tried to stress testing yet. I just reseated the card and verified the 6 and 8 pin connectors are properly seated.
I am using dual supermicro PWS-920P-SQ 920W power supplies and I am splitting the load by placing the 6PIN on one of the power supplies and the 8PIN on the other power supply.
I just tested again with the plotting software and it went off the bus at 95W and 76c. I am monitoring with nvidia-smi dmon -o DT.
I will try a stress testing with furmark or similar product and let you know the results.
My gut feeling is that this is not a good idea, but I don’t have practical experience with dual PSUs, so cannot authoritatively opine on this one way or the other.
The nominal power draw of the GTX 1080 Ti is 250W, of which the 8-pin connector delives 150W, the 6-pin connector delivers 75W, with the balance supplied by the PCIe slot (less than 75W per the PCIe spec). Drawing 95W therefore should not be problematic per se.
If it is not a power issue and not a thermal issue, it could be one of PCIe signal integrity. You mentioned that you re-seated the GPU in the PCIe slot. It the GPU also mechanically secured? PCIe connectors are not super robust, so any kind of mechanical stress on them needs to be avoided. Heavy GPUs in particular must be fastened at the bracket, usually screws or clamps. Vibrations (in the past these could be caused by rotational mass storage) need to be avoided.
I have it secured with the bracket at the back of the server with two screws. It is also held in place with the clip on the front of the PCIe slot.
I downloaded hashcat to stress test and received an error about Unsupported ptxas version 8.3; current version is 8.0. I did some research and I may have incompatible version of nvidia drivers vs cuda.
I am going to research that tomorrow. Going to bed now. Thanks for all your help. Will let you know how it turns out.
I think I may have it working. It was failing with cuda 12.3 and nvidia drivers 525.
I am now running cuda 12.3 and nvidia drivers 545. I have successfully completed a plot. I will let it run through the night and hope to have 10 new plots in the morning.
Unfortunately, It still fell off the bus, but it was able to make a few plots before it happened. I am thinking it may be bad hardware. I have the similar setup with a Titian X and it is working without any issues.
Defective hardware is definitely a possibility, but quite rare in my recollection. One interesting case that came up in this forum in recent years involved a defective auxiliary power connector on the GPU. Other rare cases have involved damaged or dirty PCIe connectors.
How would I test the auxiliary power connector? Also, any suggestion on how to clean the PCIe connectors. I could use an eraser on the board, but what about the connector on the mother board?
I am not an electronics technician. My knowledge is based on experience with building systems from components in the olden days before switching to ready-to-run workstations from one of the major US system integrators, as that turned out to be less of a hassle.
The poster with the faulty (I am guessing a bad solder joint, but they did not specify) auxiliary power connector eventually found it because failures were intermittent. That kind of problem one might find by gently wiggling the plugged-in cable while the system is running.
Damage to PCIe connections typically happens to the component soldered to the motherboard, i.e. the slot. There are springy metal “tongues” in there that are rather delicate. These tongues can lose their springiness if a heavy board constantly pulls the card to one side (such as a GPU might do in a tower enclosure if not secured at the bracket). Mechanical damage can also happen when boards are jammed into the slot with force without being properly aligned. Sometimes there is just loose dust accumulated in the slot which can be vacuumed out or blown out with compressed air (the stuff that comes out of a can, not an automotive air compressor).
Damage to semiconductors can happen due to static discharge when handling electronic components. Last time I did work in an industry lab (during the initial bring-up for a new graphics chip), boards still came in anti-static bags, there was a grounding mat at the entrance to the lab, and we were required to put on a grounded wrist strap before handling semiconductors. I am guessing all that still applies, but modern parts may have better ESD protection.
Electronic components physically age. The useful life of consumer electronics parts is typically five to ten years. The underlying deterioration processes accelerate with higher operating temperatures. Electrolytic capacitors are the most common passive part to fail, e.g. in power supply units. Processors can malfunction due to hot-carrier injection (electric charge becomes trapped in the CMOS gates), which reduces the switching speed of transistors, or metal migration, which thins out wiring inside the chip slowing down signal transmission and can worst case short-out wires. Other failure modes exist, I only learned about these things very tangentially when I was involved with building x86 processors in the past.
I have some good news. I was able to get it working reliable. Did the following.
- Fixed a memory configuration error on the motherboard. I had mismatched DIMM modules mixed between the to CPUs. I reorganized so each physical CPU has all the same type RANK and speed DIMMS.
- Fixed a loose ground connector on the 6-PIN PCIe connector.
- Cleaned off the PCIe connector on the GPU with eraser.
I am pretty sure #1 fixed the issue. I am now plotting at full speed again.
I spoke too soon. When I tried mining and plotting at the same time it went offline when it reached 82C.
Not sure if it has a safeguard or not.
I have never seen a GPU shut down due to thermals. The feature exists, though. You can query the various thermal limits with nvidia-smi
. Exceeding the lowest limit triggers throttling, that is a reduction in the clock frequency of the GPU, which in turn should lead to an immediate temperature drop. This lowest limit is typically around 83 deg C. Per your previous observations, the GPU was already falling off the bus at around 75 deg C (a perfectly normal operating temperature), so in my thinking thermals are not a problem here.
I enumerated issues I am aware of that could trigger the “falling off the bus” scenario. My understanding is that this is an event detected by HW when a breakdown in communication occurs between host and device fairly low in the PCIe protocol stack.
Have you tried putting the GPU into a different slot? Is a PCIe riser card being used in this machine, or are PCIe cards plugged directly into slots on the motherboard? Use of a riser card increases the number of physical connectors in the signal path and can negative impact signal integrity.
You may want to have somebody familiar with DIY system assembly look over your system configuration. Maybe something jumps out visually that has not been discussed here.
I put the server in mining only mode and it has been running for about 12 hours without error. It is using around 70 to 80 watts and running at 58-60c.
I am not using riser cards. Unfortunately, I can not put it in another slot because of limitations of physical space in the server case.
Updating the drivers allowed it to fail more gracefully and it is definitely more stable than it was. For now I will run it as either a plotter or minor , but not both.
I will also query the settings and let you know what I find.
I don’t know what “mining-only mode” means, and was not aware that there are servers that come with a predefined mode of this nature.
It is my understanding that crypto-currency miners often reduce GPU clocks and reduce operating voltage to maximize performance / watt and to reduce operating temperatures so electronic components age more slowly (see Arrhenius Law). Usage profiles common in high-performance computing on the other hand often emphasize minimizing time to solution, which tends to maximize power draw and thermal load. In this context it is not unusual for GPUs to run close to the throttling limits for extended periods of time.
Since various failure mechanisms in semiconductors cause a slowdown (both in terms of switching speed and signal transmission), it seems plausible that a seven year old GPU such as the GTX 1080 Ti has physically deteriorated to a point where it can no longer operate reliably at the operating frequencies it was designed for. Depending on the age of the server platform, the same may apply to host components. Under this hypothesis, what you are observing is an indication that this GTX 1080 Ti is on its last legs; it has reached the end of its useful life and it is time to replace it.