GPU Memory Less Than Promised

Ziqi · October 27, 2022, 8:45pm

It is my understanding that A10 has 24 GB memory. However, what we saw from nvidia-smi is that only around 22 GB is available:

|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A10 Off | 00000000:01:00.0 Off | 0 |
| 0% 49C P0 61W / 150W | 0MiB / 22731MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 1 NVIDIA A10 Off | 00000000:41:00.0 Off | 0 |
| 0% 46C P0 60W / 150W | 0MiB / 22731MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 2 NVIDIA A10 Off | 00000000:61:00.0 Off | 0 |
| 0% 47C P0 60W / 150W | 0MiB / 22731MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 3 NVIDIA A10 Off | 00000000:C1:00.0 Off | 0 |
| 0% 45C P0 57W / 150W | 0MiB / 22731MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 4 NVIDIA A10 Off | 00000000:E1:00.0 Off | 0 |
| 0% 46C P0 59W / 150W | 0MiB / 22731MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
±----------------------------------------------------------------------------+

Could anyone help understand what happened?

njuffa · October 27, 2022, 8:58pm

Is this a Linux system? Based on the output of nvidia-smi, pretty exactly 92.5% of raw GPU memory are reported as available to user apps.

(1) CUDA requires some GPU memory for its own data structures
(2) If this GPU supports ECC, some memory may be reserved for storing the ECC check bits
(3) Something seems to be running on these GPUs, because power state is P0 and power usage is reported as ~60W, which is much above typical idling power (I would estimate around 8W at idle). Whatever is running on these GPUs is likely using some memory. What is confusing though is that it says “no running processes found” and that GPU utilization is reported as 0%. I cannot explain that.

Robert_Crovella · October 27, 2022, 10:52pm

A10 supports ECC and it is enabled by default. ECC on GDDR GPUs reduces the available memory.

Ziqi · October 27, 2022, 10:54pm

But in our case, ECC is zero, so I am assuming it is disabled?

Robert_Crovella · October 27, 2022, 10:56pm

You mean “Volatile Uncorrectable ECC” is zero? That is a report of ECC errors detected, not whether ECC is enabled or not. If ECC were disabled that field would report N/A, not 0.

nvidia-smi has command line help available, you can use that to learn how to turn ECC on or off. The longer form nvidia-smi output (-a) will also show additional ECC info.

Ziqi · October 27, 2022, 10:57pm

Great! Thanks!

Ziqi · October 28, 2022, 3:07am

Actually, is ECC really useful? I am facing a situation where through turning off ECC we have adequate memory to assign two processes to a single A10, at an unknown risk though. Without turning ECC off, we have to keep optimizing our code. Is it a reasonable choice to turn it off?

Robert_Crovella · October 28, 2022, 3:59am

I don’t know what the risk would be either.

Speaking for myself, I consider ECC to be a useful feature and would never turn it off.

njuffa · October 28, 2022, 6:30am

Weighing the risk depends on your use case, so is something only you can decide. There is a huge difference whether your are merely creating nice pictures with fractals for recreational purposes, or whether you are working with, say, medical imaging. Personally, I would never want to have to wonder whether what is shown in an image is a cancerous lesion or an artifact caused by a flipped bit in memory.

GPU memory error rates are typically quite low in ordinary circumstances. In a cluster with about 400 machines running CUDA accelerated applications close to 24/7, I recall seeing a simple memory error (single flipped bit) once every few days. However, published data [can’t find a reference on the quick] from much larger clusters in supercomputers shows that error rates are not distributed evenly: most of the GPU memory errors are attributable to a fairly small group of nodes, and are also depend on environmental conditions. For example, higher ambient temperature correlates with higher error rates.

If your production system comprises workstation-class or server-class machines with ECC-protected system memory, that would be a pretty good indication that it is also a good idea to keep ECC enabled on your GPUs.

Personally, I tend to to err on the side of caution so I have been using Xeon-class CPUs with ECC-protected system memory and Quadro GPUs for the past 20 years (not all of these Quadro GPUs have offered ECC, as this is limited to high-end GPUs).

Curefab · October 31, 2022, 4:07am

You could also just execute your calculations twice or better - on two devices.

But this depends on how long your calculations take: In a simulation taking several months, multiple errors may happen and a simple repetition of execution may (possibly) result in a different value each time. So you either would have to save intermediate data (to narrow down and repeat those parts) or do error correction (-> ECC!) instead of just detection.

There are also different classes of errors: A cosmic high-energy ray (the most common cause!) may once flip bits, when traversing through. A faulty memory module on the other hand may have frequent errors. Both errors won’t come at predictable frequencies. Cosmic rays may depend on the sun activity (if coming directly from the sun or by the sun’s influence on the earth magnetic field. Or they occur suddenly, when a supernova happens.) A hardware defect can happen by a bad batch of hardware or by how they are handled (e.g. spikes in supply voltage).

njuffa · October 31, 2022, 7:51am

GPU with ECC have SECDED (single error correct, double error detect) capability.

Jim Rogers, “GPU Errors on HPC Systems: Characterization, Quantification, and Implications for Architects and Operations”, GTC’15 Session S5566 (slide deck online), presented some statistics on single-bit and double-bit errors logged on Titan, and early GPU based supercomputer. Errors were logged on 18,000 GPUs over 22 months.

Interesting findings reported:

(1) “98% of the single bit errors were confined to 10 cards”, which were traced to test escapes for L2 cache not GPU on-board memory. From my experience with CPUs, test escapes tend to be an issue early in a product’s life cycle, when production screening is as tight as it should be. Supercomputers are often the first customers of certain GPU products.

(2) The total number of (uncorrectable) double-bit errors recorded was just 91, which corresponds to roughly one per 3 million GPU hours.

Ziqi · November 4, 2022, 8:10pm

Very interesting finding. Seems that newer driver costs less memory for this ECC feature. Using a newer driver (5**), we saw nearly 24 GB available memory on A10. Is this feature driver-dependent?

njuffa · November 4, 2022, 8:27pm

Whether ECC requires reserving parts of the memory for the storing of check bits is a function of (1) the DRAM technology used (2) whether ECC is turned on. Generally speaking, a memory subsystem based on GDDR will reserve memory for check bits as there are no dedicated resources to store them (“in-band storage”), while an HMB-based memory subsystem provides dedicated storage for the check bits (“out-of-band storage”). To my knowledge, the A10 uses GDDR6 memory.

So to my understanding, unless installing a new driver changes the default ECC state (on/off), it should not affect this overhead. I would suggest double-checking the ECC state currently in effect with nvidia-smi.

cnmcdee · December 14, 2022, 4:30pm

I found running K80’s in a Linux Environment had it using about 100 Watts idling… Here a 3060ti is using 12Watts in a headless configuration.

But I agree, when I coded a utility inspection I have the following:

6225002496 / 1024 / 1024 / 1024 gives me 5.7GB? Or are they using the 1000 for the metric?

Robert_Crovella · December 14, 2022, 5:00pm

Not sure why you are bringing power discussion into this thread. Its quite possible that different GPUs have different power profiles and different power management strategies. It’s also possible that NVIDIA power strategy has changed over the years.

3060Ti is a relatively recent GPU. K80 is about 10 years old.
3060Ti is a GeForce GPU, designed to go into a system that may only consume in total a few hundred watts. The target systems for K80 typically consumed a few thousand watts.
The product strategy for 3060Ti probably did not involve maximum CUDA performance in a datacenter role. The amount of time it takes to run a CUDA code being impacted by pulling the GPU out of a lower power idle state was probably considered and acceptable tradeoff to achieve lower idle power. The product strategy for K80 may be different from that. The K80 measurement will also surely be impacted by persistence mode, so you may be comparing apples and oranges.

It’s not surprising to me that the power behavior of these two very different GPUs is different.

The 6225002496 number is in units of bytes. The 8192MiB number is evidently in units of MiB = 1024 x 1024 bytes. I don’t know why your utility is reporting ~6GB on a 8GB GPU. Possible reasons may include bugs in your utility code, or some other consumer of GPU memory happening concurrently with the run of your utility. It would appear that ~2GB of GPU memory is being used up, and it is not ECC as being discussed in this thread - 3060Ti doesn’t support ECC. (3060 Ti has 8GB of GPU memory.)

njuffa · December 14, 2022, 7:55pm

It’s not. The output shown says it is for a GTX 1660 Ti, which according to the TechPowerUP database is a GPU with 6 GB of memory.

The difference between the number shown and the full 6 GB is about 220KB, so the number displayed looks like it is the amount of memory available to the user, rather than total physical memory. 220KB used for the GUI and CUDA’s own needs seems entirely plausible.

Robert_Crovella · December 14, 2022, 8:51pm

I think it is 220MB.

6144x1024x1024 = 6,442,450,944
op reported:     6,225,002,496

Still I think 220MB would be fairly typical for a CUDA context creation, and the runtime probably creates a CUDA context for some of the APIs used to retrieve that info.

njuffa · December 14, 2022, 8:55pm

Agreed. Brain fart on my part. I’ll increase my caffeine infusion …

cnmcdee · December 15, 2022, 7:09pm

Power was brought up:

(3)“Something seems to be running on these GPUs, because power state is P0 and power usage is reported as ~60W, which is much above typical idling power.”

I was only confirming I would get some significant power usage on K80’s even when idling…

You sir are correct - the K80’s are Behemoths

In order to get them to run (safely) in a non-server environment I ended up giving them a 1000W PS, then using a high volume AC fan literally inches from them. Next because I was running two K80’s inches apart I needed to then power-limit them to 100Watts per Core Block (K80’s are 2 blocks of 2496 cores a-piece) But it worked very nicely for a home environment without requiring turbine blower server-fans - the setup was near silent.

However getting all the 470 driver lock-downs because of the older-compute capability and based on the fact I was getting about 4TFLOPs/card when a single 3060ti can give me 16.2 Tflops I just upgraded - less headache in driver terms.

The magical thing about the K80’s is in FP64 bit they still outperform a 4090ti today - and that is going to make them a big collector item for the data centre trying to maintain their 64 bit farm. AFAIK some electronic simulations require the 64-bit and are migrating to 128-bit now.

njuffa · December 15, 2022, 10:30pm

Well, I brought up the power consumption of OP’s A10 as potentially indicative of something running on the GPU that uses some of the GPU memory. So the point was germane to the discussion of GPU memory usage. I have since learned that use of clock locking can lead to significant power consumption of idling GPUs, making the observation somewhat of a red herring.

I don’t quite recall when NVIDIA first added power management to their GPUs, it might well have been with the Kepler generation. Initially, GPU power management was quite crude and not nearly as sophisticated as today, where GPUs provide fine-grained voltage and frequency stepping, internal functional blocks that can be turned on or off as needed, and dynamically configured PCIe interfaces, all adapting to closely monitored power usage, thermals, and voltage stability.

The K80 was designed as a high performance SKU and is also a dual GPU design. The emphasis on performance / watt that drives NVIDIA’s newer architectures was at best present in rudimentary form. So the K80 is a power hog. I would not recommend using one today. But even with the latest hardware and power management, a hypothetical high-end dual-GPU design (like the K80) might well be idling at 30+W.

As I understand it, the focus on power efficiency in modern GPUs is driven by the needs of the supercomputing market where the professional GPU lines are concerned, and by government regulations (in particular the EU, but also places like California) as it pertains to consumer GPUs. The dynamic clocking schemes that are part of the power management also provide some incremental performance gains by allowing for the exploitation of engineering margin (which traditionally was around 20% and was the target of overclockers).

Topic		Replies	Views
K80 crashed or wrong computation results on K80 CUDA Programming and Performance	13	4949	September 20, 2015
K80 application clock limited to 562 Mhz CUDA Setup and Installation	18	4211	March 2, 2021
K20 with high utilization, but no compute processes. CUDA Setup and Installation	12	26617	March 19, 2015
Limited clock for the new RTX3090Ti + Ubuntu 20.04 CUDA Programming and Performance	15	2839	December 5, 2022
Why are GPU so memory bound? CUDA Programming and Performance	3	2285	January 22, 2023
Frequent catastrophic crashes on a multiple GPU machine CUDA Setup and Installation	8	4645	October 22, 2017
GTX 590 CUDA power tests CUDA Programming and Performance	40	10100	January 29, 2012
What's new in Maxwell 'sm_52' (GTX 9xx) ? CUDA Programming and Performance	69	26907	December 23, 2014
why "all CUDA-capable devices are busy or unavailable" ? CUDA Programming and Performance	34	64156	April 20, 2011
Multi-GPU performance incredibly slow CUDA Programming and Performance	7	2964	January 2, 2020

GPU Memory Less Than Promised

Related topics