Tesla K40 1TB RAM problem

Duno · July 6, 2017, 8:06pm

Hello, I have a custom server that I built with 3 Tesla K40’s for some high-intensity simulation processing. The system blue screen’s anytime I have the full 1TB of RAM installed, I have determined this to be the fault of the K40’s driver.

The problem is that the system is a quad-channel system. If I remove a stick of RAM, and bring the system down to 960G of RAM, the system performance degrades to a single-channel configuration, which hinders performance noticeably (about 20%). I can bring the system down to 512G of RAM and everything runs optimally, but then I can’t run the larger simulations that need the 1TB of memory space, which is what I built the system for, and 20% doesn’t seem like much but in this instance, it is a measure of days, and some of these simulations are time-critical.

I was wondering if there was an environment variable (or something like it) within the K40 that I could manually set to only give it access to 960G of the system RAM? That way I could have the full 1TB plugged into the system, and the motherboard/processors would still operate under the quad-channel configuration and the system wouldn’t blue screen due to the NVidia driver limitation.

Any assistance would be appreciated, thanks in advance!

tera · July 6, 2017, 8:19pm

Have you thought about trying Linux instead?

njuffa · July 6, 2017, 8:41pm

[s]If you have not done so yet, I would highly recommend reporting this as a bug to NVIDIA. Even companies the size of NVIDIA do not routinely have Windows systems with 1 TB of system memory sitting around in their QA departments (or anywhere for that matter).

I am curious how you determined conclusively that the GPU driver is at fault. It seems possible for a TCC driver to cause a system panic, but I would have considered an OS component a more likely source of that, or instability of the hardware.[/s]

Robert_Crovella · July 6, 2017, 9:18pm

It’s not (primarily) a windows or linux issue. It’s a limitation of the K40 and all pre-pascal GPUs that have a 40-bit TLB map (and, to some extent, system BIOS dependent).

You’ll be limited to 1TB of memory (or 512GB for Fermi GPUs, not in view here), and that is only achievable in special situations. For most typical situations and the way most server BIOSes work, you are limited to less than 1TB. Here’s some indication for this issue:

[url]https://us.download.nvidia.com/XFree86/Linux-x86/331.20/README/addressingcapabilities.html[/url]

AFAIK there are no environment variables that can work around this or modify it. It’s a function of where the system assigns resources that the K40 needs, above or below the 1 TB barrier. You may be able to find some system BIOS entries that affect mapping, and/or OS config parameters, and it may be worth a try, but I’ve not personally worked thru the process.

Pascal (and future) GPUs should not have this limitation. They have something like 49bits of TLB map range:

[url]https://devblogs.nvidia.com/parallelforall/inside-pascal/[/url]

"GP100 extends GPU addressing capabilities to enable 49-bit (512 TB) virtual memory addressing (note that GP100 also supports 47-bit (128 TB) physical memory addressing). "

Duno · July 6, 2017, 9:36pm

Thanks for the input everyone, Linux isn’t an option but it’s not a Windows issue. The K40 driver limitation is known by NVidia.

Thx TxBob, yes the P100’s are the end solution to my issue but I have a great number of sims that need to run between now and when I get funding for a new $60,000 server, so I was hoping there was a stop-gap measure I could institute in the mean time. The open-source community is usually better at these things than the direct company, so here I am. lol.

njuffa · July 6, 2017, 9:48pm

Are there steps you can take in the configuration of your simulation software to reduce memory footprint? Most sophisticated simulation environments I have seen come with myriad configuration switches covering all kind of potential tradeoffs, memory footprint often being one of them (not many people have a 1 TB system at their disposal).

Topic		Replies	Views
Tesla failed to allocate 5GB memory CUDA Programming and Performance	1	668	March 31, 2015
Titan V missing memory? CUDA Programming and Performance	5	1021	July 15, 2018
NVIDIA Tesla P40 dynamic GPU memory allocation possible? General Discussion	6	6547	January 13, 2018
Run LLM in K80 CUDA Programming and Performance	3	7299	July 21, 2023
4GB system memory required for each installed TESLA C0160? Is this really true? CUDA Programming and Performance	4	4773	March 15, 2009
Cannot install driver for NVIDIA tesla k40 cards on Fedora 20 CUDA Setup and Installation	15	14227	April 8, 2019
K40 setup on Lenovo P510 CUDA Setup and Installation	22	7222	July 26, 2023
Tesla K80 detected on OpenSuse 15.5, but nvidia-smi couldn't communicate with the NVIDIA driver Linux driver	8	1384	June 18, 2023
Drivers on Vista for Tesla 1060 and Quadro 5800FX CUDA Programming and Performance	8	13156	March 7, 2009
Is it possible to use more than 4 GB of VRAM using OpenCL? CUDA Setup and Installation	1	1649	February 9, 2015

Tesla K40 1TB RAM problem

Related topics