I have a GeForce RTX 2070 Super on a Linux Certified laptop running
CentOS Linux release 8.2.2004 (Core)
Kernel 4.18.0-193.14.2.el8_2.x86_64
NVIDIA Driver Version: 450.57
The kernel leaks memory, maybe 1 GB a day. After a few days,
I have to reboot. I'm suspicious the nvidia device driver
may be involved although I am not doing any heavy graphics.
As far as I can tell, my graphics/display is working fine.
Been looking for release notes or bugs lists that might
discuss this, but haven't found anything. Maybe google is
letting me down. Could this be a known problem? With a
fix in the works? Or maybe I've unintentionally ended up
test piloting a new hardware/software combination.
Memory use report by top as 'used' keeps ratcheting upwards.
Ditto memory used by report by free.
Ditto MemAvailable reported by /proc/meminfo.
Stopping all applications, or logging out, of course reduces used, and
increases MemAvailable, but not quite back to where it was the day before.
I'm been wondering about this for a couple months and think I've
eliminated tmpfs, slab memeory, shared memory, and memory reported for
individual user processes. They all look reasonable.
The "Mem:" line from free, reported every hour, even at night when the
laptop is basically idle, looks like
total used free shared buff/cache available
Mem: 31828 22778 619 338 8431 8252
Mem: 31828 22947 443 338 8437 8082
Mem: 31828 23049 343 338 8435 7981
Mem: 31828 23000 408 338 8420 8030
Mem: 31828 23103 377 338 8348 7927
Mem: 31828 23300 308 338 8220 7730
Mem: 31828 23147 500 338 8180 7883
Mem: 31828 23285 400 338 8143 7745
Mem: 31828 23489 275 338 8063 7541
Do you know of anybody who might have tried Centos 8, Kernel
4.18.0-193.14.2.el8_2.x86_64, and nvidia driver 450.57? Would you
expect a combination like this to work? I'm worried I've accidently
gotten out on the bleeding edge. Is anybody else seeing a similar
behavior?
If there are specific statistics that might shed light I can try to
collect them.
RHEL and clones like Centos/Alibaba Linux/Scientific Linux + nvidia is a very common setup in science and compute clouds. So no problems to expect from that.
Please check first if there’s a kernel update available by running system update.
To start kernel memory analysis, you should look into turning on kmem tracing and using the kmemleak module. Those should give you a hint on where to look at.
Follow up for anybody who finds this thread. I was unsuccessful turning
on kmemleak (I think it has to be compiled into the kernel), and ran into
problems updating the kernel as well.
With the input that I had a standard configuration and there were no
known large nvidia memory leaks, we looked further afield. We
installed acpid.x86_64 to clean up an nvidia warning, noticed a
hyper-active kworker/kacpid process, and started looking for an acpi
memory leak angle. This led us to try
echo "disable" > /sys/firmware/acpi/interrupts/gpe6F
as a work around to a possible acpi memory leak and low and behold,
the system continued to run fine and the leak seems to have stopped.
Likely no nvidia angle to this at all! Will continue to monitor.