Problems after inserting a P100

I inserted a P100 into my server, and now I can’t run my CUDA samples anymore. For example, deviceQuery dies with:

CUDA Device Query (Runtime API) version (CUDART static linking)
cudaGetDeviceCount returned 999
→ unknown error
Result = FAIL

I deleted the file and recompiled, and the compilation goes through, but when I run it, same problem. Others show different problems, for example BlackScholes complains with:

CUDA error at …/…/common/inc/helper_cuda.h:779 code=999(cudaErrorUnknown) “cudaGetDeviceCount(&device_count)”

Here same thing: I delete the file, recompile, the compilation succeeds, but when I run it again, same error.

So, by inserting another GPU, something with DeviceCount malfunctions.

I have three generations of cards in this, but a previous query had njuffa confirm to me that it won’t matter. I have:

device 0: GTX 980
devices 1 through 4: K80
device 5: P100 (the 16 GB PCIe version)

I use the 470.57.02 driver, that is the latest available that supports Tesla cards (and apparently also the latest that supports Kepler, it seems after this Tesla-supporting driver I can’t use my K80s anymore, can anyone confirm?).

CUDA 11.4.2 on Fedora 34.

The P100 shows up just fine in nvtop, nvidia-smi, and nvidia-settings (in fact, nvidia-settings now shows me an additional disabled virtual device – “NVIDIA VGX”. Have to learn how to use that given that the P100 has no video output).

Thanks!

Are any errors logged at system level? Check dmesg. In particular, are there any notifications RmInitAdapter failed!?

All Tesla cards require large BAR0 apertures. The system BIOS of your machine may provide limited usable address space for these apertures that is insufficient to add a fifth Tesla card. This may cause apertures for GPUs to be mapped on top of each other or not being mapped at all, making thse GPUs non-operational. The system BIOS may or may not have relevant configuration settings; something to check. An interesting experiment would be to remove all K80s to see whether the system can function properly with just the GTX 980 and the P100 present.

Given that you have been running with multiple K80s, I assume you are aware of the airflow requirements for passively-cooled GPUs like the K80 and the P100, and there are no messages similar to this in the system log: “NVRM: Xid (PCI:0000:01:00): 79, GPU has fallen off the bus.”

I am not aware of a discontinuation of K80 support in recent NVIDIA drivers, although I am expecting that to happen fairly soon. I would expect there to be an explicit notice in driver release notes when K80 support is discontinued and cannot find any such note.

What is the provenance of the Tesla P100? Is the source reputable? There is a certain amount of fraud in the used-GPU market (as evidenced by earlier questions in this forum), and this includes the sale of defective as well as counterfeit hardware.

[Later:]

It seems the K80 may have a longer remaining life than I expected, per this tidbit from Microsoft:

The first group of GPU products to be retired in Azure are the original NC, NC v2 and ND-series VMs, powered by NVIDIA Tesla K80, P100, and P40 datacenter GPU accelerators respectively. These products will be retired on August 31st 2022, and the oldest VMs in this series launched in 2016.

1 Like

Check dmesg . In particular, are there any notifications RmInitAdapter failed! ?

None of that in it.

Second para, I am aware of that, but the P100 is now in a slot that had another (a third) K80 in it before. So wouldn’t removing 24GB and filling it with 16 GB only “free” memory? Unless that is some per-GPU memory limitation, because each K80 GPU has only 12 GB, and the 16GB of the P100 could now be over some limit. But then why would that disable things for all GPUs? Even the GTX 980 can’t run CUDA anymore (./BlackScholes -device=0 throws up with: CUDA error at …/…/common/inc/helper_cuda.h:724 code=999(cudaErrorUnknown) “cudaGetDeviceCount(&device_count)”) Hrimpf, I guess I’ll have to remove K80s, I was hoping I could avoid that.

Third para, yes, I am aware of the cooling requirements, but heat doesn’t seem to be a problem here. I look at it all day on nvtop and with “watch nvidia-smi”, and the P100 is always in the 38 - 45C range as idle. I don’t find the error you mentioned in any of the files in /var/log (did I look in the right place? Is what you mean with “system log”). But heat doesn’t seem to be the problem at all.

Fifth para, ebay seller has over 4600 transactions, has a red star (I think that means “hot seller”), and has a positive feedback rating of 99.8%. Willing to trust him, and I’d find that hard to believe that a faulty P100 would cripple CUDA for ALL the GPUs, which seem to be working otherwise.

The BAR0 aperture has nothing to do with mapping GPU onboard memory into a unified virtual address space which is something that CUDA does.

It is a low-level PCIe feature for mapping interconnect-specific addresses to expose a bunch of memory-mapped I/O that is needed for communication between the GPU hardware and the lowest levels of the NVIDIA driver stack (below the CUDA driver). BAR stands for Base Address Register, I think, but I am not a PCIe expert.

I don’t know the specific size of this aperture for various NVIDIA GPUs; I only recall that the BAR0 apertures required by Tesla cards are generally significantly larger than those required by consumer cards, and that this is a fairly frequent issue with DIY-integration of Tesla GPUs into systems. There may be error messages like this in the system logs when it happens: “NVRM: This PCI I/O region assigned to your NVIDIA device is invalid”

The configuration of PCIe resources may be configurable in the system BIOS, but this will differ by system, so I am unable to give more detailed advice.

As it turns out, all I needed to do was to disable virtualization and Intel’s VT-d in the BIOS. I don’t need virtualization technology on this server anyway. Your specific text “NVRM: This PCI I/O region assigned to your NVIDIA device is invalid” caused me to check for the string “NVRM” in the system logs, that yielded nothing but showed an error pointing to DMAR and RMRR. Then I looked for that: dmesg | grep DMAR, and that was supposed to end with the line “DMAR-IR: Enabled IRQ remapping in x2apic mode” and have no errors. Well, it did end with that line, but complained about a BIOS error and that I should contact the BIOS vendor to fix it. After more web research I found in a forum that HP isn’t going to fix that bug in the BIOS. Did a reboot with Intel virtualization and VT-d turned off, and bingo.
Strange, however, that deviceQuery has the P100 as device0, and both nvidia-smi and nvtop have it as device1. I think it kinda should be device1, because the right-most card, as installed in the mobo, is the GTX 980 (with display), and then the Kepler and Pascal cards come to the left of that. And nvidia-smi -i 0 -q and nvidia-smi -i 1 -q confirm that the GTX as device0 has BusID 04:00.0 and the P100 has BusID 0A:00.0. So I am irreducibly confused as to why deviceQuery has them the other way around. Deleting the file and recompiling (in CUDA 11.4.2) produces the same output.
Also, nvidia-settings still has the P100 with a disabled virtual screen.
Thanks for getting me going on the right track. I’m hopeful I can re-insert the other K80 tomorrow for another two K80 GPUs.

1 Like

Enumeration of GPUs in PCIe-space and in CUDA-space works very differently. nvidia-smi is a driver-level tool that uses the former, while deviceQuery is a CUDA runtime tool that uses the latter.

As I recall (and this should be mentioned in the documentation somewhere), CUDA’s enumeration uses a heuristic that assigns device 0 to the most powerful GPU in the system. The most powerful GPU in your system is clearly the Tesla P100.

I don’t know the details of how GPUs are enumerated in PCIe-space, but presumably this is by PCIe root complex and increasing BusID for each root complex.

first para: thanks, that explains
second para: if it is in the documentation, I’m sure someone will find it for me :). There seem to be people around that are under the belief I don’t read documentation. But what constitutes “most powerful” must be a matter of debate (perhaps it’s documented :) ). With more and more features getting included into NVidia cards, the notion of “powerfulness” must clearly diverge among users/programmers. Raw CUDA speed, ray-tracing performance, availability of tensor cores, etc., …
third para: you’re probably right (as always)

Makes it appear that my BIOS does not have a memory / aperture size limitation that makes larger than 12GB GPUs impossible. Means I’m hopeful I can incrementally replace the K80s with V100 and more recent cards that have larger memory sizes.

Do all “post-P100” Tesla cards have virtual displays now? I was surprised to see the disabled virtual display in nvidia-settings for the P100. Perhaps that’s turn-offable somehow (didn’t find it in the nvidia-smi documentation). Not sure if I like that, at the very least it adds another layer of complexity that is better turned off instead of allowing it to confuse the system.

A piece of ancient advice to programmers states: If all else fails read the documentation.

As I said, CUDA’s determination of which GPU is most suitable to take the lead as device 0 is done by a heuristic. The details of the heuristic are not documented, best I know, just like the Fair Isaac Corporation doesn’t explain how they compute your FICO score (details of which may change at any time, etc., etc.). Presumably the number of CUDA cores and the compute capability feature prominently in the formula.

I know nothing about virtual displays, and have used none of P100, V100, A100.

This Q&A on Stackoverflow states that there is a way to make CUDA’s GPU enumeration identical to the PCIe enumeration:

export CUDA_DEVICE_ORDER=PCI_BUS_ID

Might be worth a try. In the CUDA Programming Guide this information is in Appendix M. CUDA Environment Variables

I think I’ll try swapping device0 and device1 around in the mobo. Right now, the P100 blocks the airflow to the fans of the GTX 980, and there is not enough space to attach a cooler / fan assembly to the P100. If I swap them, nothing blocks the airflow to the GTX, and I can attach a cooler / fan assembly to the P100, so I can actually put some load on it. I think I’ll need to shuffle some interrupts or “PCI-E boot order” in the BIOS for that.
Thanks for your guidance!

Abstracting from any P100-specific cooling issues, the dynamic clock boosting on GPUs based on Pascal and later architectures highly incentivizes aggressive cooling. If you manage to keep GPU temperature below 60 °C, boosting to ~1800 MHz is not out of the question. With stock coolers, under sustained load, I only to manage to run at that speed for a few minutes. From what I can see on around the internet, water cooling may be necessary to get NVIDIA’s modern GPUs to run at <= 60 °C at all times. I am too scared of leaks to try it.

It’s incredible how fast the Tesla cards get hot. I have a ProLiant in 4U design, so it has the fans between the drives and “everything else”, but they don’t run at full speed. So in principle I have the cooling that NVidia had envisaged for the Tesla cards: passive cooling from the fans that sit behind the drives in a typical rack-mount server design. But I guess that only works at full speed. I’ve already fried one of the K80s with Ethereum mining, the temp was in the 80s, and suddenly my whole system froze after just a few minutes. Yesterday I tried it with the P100, and it shot up within 1 - 2 minutes, and at 75C I interrupted etherminer. Got an impressive hashrate though, 31.x MH/s. So, lessons learnt:

  • test with cheap stuff, it’s ok to fry a used K80 for testing and then place it in your at-home museum
  • Tesla card passive cooling at normal or adaptive fan speed in rack mounts is not enough. You need active cooling for the Tesla cards as well, perhaps unless you run the rack fans at full speed (not tested – I’d get deaf).
  • it does bring up the question though what the slow-down temperature does. Shouldn’t reaching the slow-down temp result in sufficient down-throttling so that overheating is impossible? What does it do if not prevent overheating?
  • the P100 gives an attractive Ether hash rate of 31.x MH/s, but in terms of $ value that’s some $3.80/day, so not worth it, even before POS / Ether 2.0. I’m done mining and will focus on improving my CUDA programming (I only ever used Ether mining for testing performance, never did it seriously).

I wonder how the data centers cool their P100s and V100s. I know AWS EC2 has V100s. They must run the rack fans at full speed, and that must be deafening. And they want to be carbon-neutral with an ambitious time goal, while running their cooling at full speed? Perplexing.

Frying a GPU should not be possible. In addition to the thermal throttling provided by software (this kicks in around 83 to 85 degrees Celsius depending on GPU) there is also a thermal shutdown implemented by the hardware which kicks in at around 90 to 92 degrees Celsius to prevent permanent damage. The temperature limits can be queried with nvidia-smi. In my observation, the software controlled-thermal and power management of GPUs does keep them reliably at the throttling limit with maximum short-time excursion beyond that of maybe a degree or so.

While high operating temperatures do accelerate the aging of electronic components (Arrhenius law), in my experience it is perfectly fine to operate professional GPUs at the thermal throttling limit 24/7, assuming a typical useful life of five years. The first component to fail is often the memory on the card (which, based on IR images on the internet, can at times run hotter than the GPU itself).

With hundreds of watts being dissipated through a square inch of silicon, it is not surprising that a high-end GPU would heat up very quickly even with a pound of heatsink attached. That’s why passively cooled GPU need a defined airflow past the heatsink fins (and those fins be better be largely free of dust). When buying from an NVIDIA-approved integrator, they have the know-how for guaranteeing adequate airflow. Your server may have been designed without so many GPUs in mind and not provide adequate airflow.

Yes, the noise in a room full of servers (GPU accelerated or not), much of it fan noise, can be quite literally deafening and hearing protection might be necessary depending on exposure time.

first para: indeed, I expected overheating damage to be impossible. Hence, my consternation that it happened. This can not have been “old age fatigue”, because I saw strict causality: have the K80 idle for several months, but running and in the system, working on my CUDA toy problems, and when I run ethminer on it I see the temps shooting up into the 80s, and then system freeze and then the card doesn’t work anymore. And it’s in the card, not the mobo or PCI-E bus, because I took it out and put another K80 in the same PCI-E slot, and it worked, so the damage must be in the K80 and not on the PCI-E bus or mobo. And then I put it in another PCI-E slot and it still didn’t work. Clearly, it’s the card. And causality is established by the observation that it worked fine for about half a year, and when I run it on etherminer it took a mere 2 - 3 minutes to fail.
I agree with you that in principle a Tesla card should be runnable near max temps forever (ok, five years), that’s what data center cards are supposed to be designed for. But after this K80 fry-up I’m reluctant to go that high with my P100, or in the future, V100, perhaps an A100 at some point in time. Which is really frustrating, because I think that if you run it at full load for something else (not mining), it should be the same power profile: heavy computation load for an extended period of time.
After your post I wonder even more why the internal overheat protection didn’t work and damaged my card.

There’s this:
https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-470-57-02/index.html

"Attention: Release 470 will be the last driver branch to support Data Center GPUs based on the Kepler architecture. This includes discontinued support for the following compute capabilities:

sm_30 (Kepler)
sm_32 (Kepler)
sm_35 (Kepler)
sm_37 (Kepler)"
1 Like

Yepp, that’s what I wrote in my original post (“I use the 470.57.02 driver, that is the latest available that supports Tesla cards (and apparently also the latest that supports Kepler, it seems after this Tesla-supporting driver I can’t use my K80s anymore, can anyone confirm?).”)
So, 470.57.02 will be the last driver and 11.4.2 will be the last CUDA that supports the K80. As far as I understand it, it should be possible to have separate CUDA installations with different versions, so it should be possible to install the 11.5 next to the 11.4.2 (I guess will need to shuffle env vars, because PATH and LD_LIBRARY_PATH are set to version-specific directories). There seems to be good stuff in 11.5, it should be possible to use that as long as I retain a 11.4.2.

@rs277 Thanks for the pointer and correction. I admit I overlooked this statement squirreled away 75% down the release notes.

It would be advantageous to put deprecation/discontinuation notices at the top of such documents.

It is certainly possible to keep and use multiple CUDA versions (I had up to three at one point), as long as the installed driver package supports all of them. Does the driver package 470.57.02 support CUDA 11.5? Based on table 2 in the CUDA 11.5 release notes the answer is yes, but only in tems of “CUDA minor version compatibility”.

As long as you take note of the Cuda minor version compatibility comments referenced in the Cuda 11.5 Release Notes, (you did read those…?), you should be able to use the K80 with 11.5 in many cases.

Yes, for some reason the doc writers seem to place dropped/deprecated feature details toward the end, rather than feature them prominently at the beginning.

I am not disputing that your K80 is a brick now. But I don’t see how old-age fatigue can be excluded as the cause of death. Conversely, no evidence has been presented that “the internal overheat protection didn’t work and damaged my card”. I like car analogies. Your observations are akin to “I have this vintage car that was doing fine as long as I was just puttering around town, but when I took it out on the autobahn to drive at top speed it threw a rod within minutes”.

How old is this particular K80? Six years, seven years? Active components age, passive components age (capacitors in particular), solder joints can develop cracks due to thermal expansion and contraction, etc. There are other potential contributing damaging factors such as static electricity during handling.

Obviously the lifetime of electronics doesn’t have a sharp cliff (e.g. “dies the day after the warranty expired”), but follows a statistical distribution, with some parts dying earlier than others. Your dead K80 could be one in the left tail of that distribution.

When I worked at AMD back in the days, they had a department that did post-mortem analysis on products returned because they died prematurely. The goal was to improve engineering and manufacturing processes based on the findings. I don’t know whether NVIDIA has a department like that. If your GPU is still covered by a warranty (unlikely) it might be worthwhile finding out.

I see what you mean. Old age fatigue may have accumulated internal, incremental damage, like micro fissures, and then when put to max load it cracked.
Those were sold on ebay as “used”, but two of the three appeared in pristine, never-used condition. I guess I’ll always have that problem when I buy used P100 or V100 cards on ebay. Even when sold as “new” or “refurbished” there can be internal damage (e. g. corrosion) that accumulates through the mere passage of time.