GPU Memory Less Than Promised

It is my understanding that A10 has 24 GB memory. However, what we saw from nvidia-smi is that only around 22 GB is available:

| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
| 0 NVIDIA A10 Off | 00000000:01:00.0 Off | 0 |
| 0% 49C P0 61W / 150W | 0MiB / 22731MiB | 0% Default |
| | | N/A |
| 1 NVIDIA A10 Off | 00000000:41:00.0 Off | 0 |
| 0% 46C P0 60W / 150W | 0MiB / 22731MiB | 0% Default |
| | | N/A |
| 2 NVIDIA A10 Off | 00000000:61:00.0 Off | 0 |
| 0% 47C P0 60W / 150W | 0MiB / 22731MiB | 0% Default |
| | | N/A |
| 3 NVIDIA A10 Off | 00000000:C1:00.0 Off | 0 |
| 0% 45C P0 57W / 150W | 0MiB / 22731MiB | 0% Default |
| | | N/A |
| 4 NVIDIA A10 Off | 00000000:E1:00.0 Off | 0 |
| 0% 46C P0 59W / 150W | 0MiB / 22731MiB | 0% Default |
| | | N/A |

| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
| No running processes found |

Could anyone help understand what happened?

Is this a Linux system? Based on the output of nvidia-smi, pretty exactly 92.5% of raw GPU memory are reported as available to user apps.

(1) CUDA requires some GPU memory for its own data structures
(2) If this GPU supports ECC, some memory may be reserved for storing the ECC check bits
(3) Something seems to be running on these GPUs, because power state is P0 and power usage is reported as ~60W, which is much above typical idling power (I would estimate around 8W at idle). Whatever is running on these GPUs is likely using some memory. What is confusing though is that it says “no running processes found” and that GPU utilization is reported as 0%. I cannot explain that.

A10 supports ECC and it is enabled by default. ECC on GDDR GPUs reduces the available memory.

But in our case, ECC is zero, so I am assuming it is disabled?

You mean “Volatile Uncorrectable ECC” is zero? That is a report of ECC errors detected, not whether ECC is enabled or not. If ECC were disabled that field would report N/A, not 0.

nvidia-smi has command line help available, you can use that to learn how to turn ECC on or off. The longer form nvidia-smi output (-a) will also show additional ECC info.

Great! Thanks!

Actually, is ECC really useful? I am facing a situation where through turning off ECC we have adequate memory to assign two processes to a single A10, at an unknown risk though. Without turning ECC off, we have to keep optimizing our code. Is it a reasonable choice to turn it off?

I don’t know what the risk would be either.

Speaking for myself, I consider ECC to be a useful feature and would never turn it off.

Weighing the risk depends on your use case, so is something only you can decide. There is a huge difference whether your are merely creating nice pictures with fractals for recreational purposes, or whether you are working with, say, medical imaging. Personally, I would never want to have to wonder whether what is shown in an image is a cancerous lesion or an artifact caused by a flipped bit in memory.

GPU memory error rates are typically quite low in ordinary circumstances. In a cluster with about 400 machines running CUDA accelerated applications close to 24/7, I recall seeing a simple memory error (single flipped bit) once every few days. However, published data [can’t find a reference on the quick] from much larger clusters in supercomputers shows that error rates are not distributed evenly: most of the GPU memory errors are attributable to a fairly small group of nodes, and are also depend on environmental conditions. For example, higher ambient temperature correlates with higher error rates.

If your production system comprises workstation-class or server-class machines with ECC-protected system memory, that would be a pretty good indication that it is also a good idea to keep ECC enabled on your GPUs.

Personally, I tend to to err on the side of caution so I have been using Xeon-class CPUs with ECC-protected system memory and Quadro GPUs for the past 20 years (not all of these Quadro GPUs have offered ECC, as this is limited to high-end GPUs).

You could also just execute your calculations twice or better - on two devices.

But this depends on how long your calculations take: In a simulation taking several months, multiple errors may happen and a simple repetition of execution may (possibly) result in a different value each time. So you either would have to save intermediate data (to narrow down and repeat those parts) or do error correction (-> ECC!) instead of just detection.

There are also different classes of errors: A cosmic high-energy ray (the most common cause!) may once flip bits, when traversing through. A faulty memory module on the other hand may have frequent errors. Both errors won’t come at predictable frequencies. Cosmic rays may depend on the sun activity (if coming directly from the sun or by the sun’s influence on the earth magnetic field. Or they occur suddenly, when a supernova happens.) A hardware defect can happen by a bad batch of hardware or by how they are handled (e.g. spikes in supply voltage).

GPU with ECC have SECDED (single error correct, double error detect) capability.

Jim Rogers, “GPU Errors on HPC Systems: Characterization, Quantification, and Implications for Architects and Operations”, GTC’15 Session S5566 (slide deck online), presented some statistics on single-bit and double-bit errors logged on Titan, and early GPU based supercomputer. Errors were logged on 18,000 GPUs over 22 months.

Interesting findings reported:

(1) “98% of the single bit errors were confined to 10 cards”, which were traced to test escapes for L2 cache not GPU on-board memory. From my experience with CPUs, test escapes tend to be an issue early in a product’s life cycle, when production screening is as tight as it should be. Supercomputers are often the first customers of certain GPU products.

(2) The total number of (uncorrectable) double-bit errors recorded was just 91, which corresponds to roughly one per 3 million GPU hours.

Very interesting finding. Seems that newer driver costs less memory for this ECC feature. Using a newer driver (5**), we saw nearly 24 GB available memory on A10. Is this feature driver-dependent?

Whether ECC requires reserving parts of the memory for the storing of check bits is a function of (1) the DRAM technology used (2) whether ECC is turned on. Generally speaking, a memory subsystem based on GDDR will reserve memory for check bits as there are no dedicated resources to store them (“in-band storage”), while an HMB-based memory subsystem provides dedicated storage for the check bits (“out-of-band storage”). To my knowledge, the A10 uses GDDR6 memory.

So to my understanding, unless installing a new driver changes the default ECC state (on/off), it should not affect this overhead. I would suggest double-checking the ECC state currently in effect with nvidia-smi.

I found running K80’s in a Linux Environment had it using about 100 Watts idling… Here a 3060ti is using 12Watts in a headless configuration.

But I agree, when I coded a utility inspection I have the following:

6225002496 / 1024 / 1024 / 1024 gives me 5.7GB? Or are they using the 1000 for the metric?

Not sure why you are bringing power discussion into this thread. Its quite possible that different GPUs have different power profiles and different power management strategies. It’s also possible that NVIDIA power strategy has changed over the years.

  • 3060Ti is a relatively recent GPU. K80 is about 10 years old.
  • 3060Ti is a GeForce GPU, designed to go into a system that may only consume in total a few hundred watts. The target systems for K80 typically consumed a few thousand watts.
  • The product strategy for 3060Ti probably did not involve maximum CUDA performance in a datacenter role. The amount of time it takes to run a CUDA code being impacted by pulling the GPU out of a lower power idle state was probably considered and acceptable tradeoff to achieve lower idle power. The product strategy for K80 may be different from that. The K80 measurement will also surely be impacted by persistence mode, so you may be comparing apples and oranges.

It’s not surprising to me that the power behavior of these two very different GPUs is different.

The 6225002496 number is in units of bytes. The 8192MiB number is evidently in units of MiB = 1024 x 1024 bytes. I don’t know why your utility is reporting ~6GB on a 8GB GPU. Possible reasons may include bugs in your utility code, or some other consumer of GPU memory happening concurrently with the run of your utility. It would appear that ~2GB of GPU memory is being used up, and it is not ECC as being discussed in this thread - 3060Ti doesn’t support ECC. (3060 Ti has 8GB of GPU memory.)

It’s not. The output shown says it is for a GTX 1660 Ti, which according to the TechPowerUP database is a GPU with 6 GB of memory.

The difference between the number shown and the full 6 GB is about 220KB, so the number displayed looks like it is the amount of memory available to the user, rather than total physical memory. 220KB used for the GUI and CUDA’s own needs seems entirely plausible.

1 Like

I think it is 220MB.

6144x1024x1024 = 6,442,450,944
op reported:     6,225,002,496

Still I think 220MB would be fairly typical for a CUDA context creation, and the runtime probably creates a CUDA context for some of the APIs used to retrieve that info.

Agreed. Brain fart on my part. I’ll increase my caffeine infusion …

Power was brought up:

(3)“Something seems to be running on these GPUs, because power state is P0 and power usage is reported as ~60W, which is much above typical idling power.”

I was only confirming I would get some significant power usage on K80’s even when idling…

You sir are correct - the K80’s are Behemoths

In order to get them to run (safely) in a non-server environment I ended up giving them a 1000W PS, then using a high volume AC fan literally inches from them. Next because I was running two K80’s inches apart I needed to then power-limit them to 100Watts per Core Block (K80’s are 2 blocks of 2496 cores a-piece) But it worked very nicely for a home environment without requiring turbine blower server-fans - the setup was near silent.

However getting all the 470 driver lock-downs because of the older-compute capability and based on the fact I was getting about 4TFLOPs/card when a single 3060ti can give me 16.2 Tflops I just upgraded - less headache in driver terms.

The magical thing about the K80’s is in FP64 bit they still outperform a 4090ti today - and that is going to make them a big collector item for the data centre trying to maintain their 64 bit farm. AFAIK some electronic simulations require the 64-bit and are migrating to 128-bit now.

Well, I brought up the power consumption of OP’s A10 as potentially indicative of something running on the GPU that uses some of the GPU memory. So the point was germane to the discussion of GPU memory usage. I have since learned that use of clock locking can lead to significant power consumption of idling GPUs, making the observation somewhat of a red herring.

I don’t quite recall when NVIDIA first added power management to their GPUs, it might well have been with the Kepler generation. Initially, GPU power management was quite crude and not nearly as sophisticated as today, where GPUs provide fine-grained voltage and frequency stepping, internal functional blocks that can be turned on or off as needed, and dynamically configured PCIe interfaces, all adapting to closely monitored power usage, thermals, and voltage stability.

The K80 was designed as a high performance SKU and is also a dual GPU design. The emphasis on performance / watt that drives NVIDIA’s newer architectures was at best present in rudimentary form. So the K80 is a power hog. I would not recommend using one today. But even with the latest hardware and power management, a hypothetical high-end dual-GPU design (like the K80) might well be idling at 30+W.

As I understand it, the focus on power efficiency in modern GPUs is driven by the needs of the supercomputing market where the professional GPU lines are concerned, and by government regulations (in particular the EU, but also places like California) as it pertains to consumer GPUs. The dynamic clocking schemes that are part of the power management also provide some incremental performance gains by allowing for the exploitation of engineering margin (which traditionally was around 20% and was the target of overclockers).