Hi,
I have strange memory leak in linux (4.4.0-31-generic), driver 352.93, Tesla K20m.
Hello world like program allocates about 70MB of memory in OS and doesn’t free after program exit.
Here is the code:
#include <cuda.h>
int main(int argc, char* argv )
{
//cudaSetDevice(0);
int *XiXj_d;
cudaMalloc(&XiXj_d, 1 * sizeof(int));
cudaFree(XiXj_d);
}
Run:
user@cuda2:~/bug$ free -m
total used free shared buffers cached
Mem: 15924 380 15544 1 29 118
-/+ buffers/cache: 232 15692
Swap: 0 0 0
user@cuda2:~/bug$ ./bug
user@cuda2:~/bug$ free -m
total used free shared buffers cached
Mem: 15924 459 15464 1 29 125
-/+ buffers/cache: 305 15619
Swap: 0 0 0
It seems that driver or linux kernel doesn’t free memory. Any idea what is going on here?
Based on your driver version (352.93) I imagine you are using CUDA 7.5
CUDA 7.5 is not compatible with kernel 4.4
The official support matrix for CUDA 7.5 is listed here:
[url]Installation Guide Linux :: CUDA Toolkit Documentation
I would recommend that you switch to an officially supported setup.
CUDA 8RC1 advertises support for Kernel version 4.4 on Ubuntu 16.04
Hi,
I am experiencing the same problem when I use Caffe.
I have a Tesla K80, Ubuntu 14.04.
txbob,
Unfortunately switching to the official kernel (3.13.0-92-generic #139-Ubuntu ) for Ubuntu 14.04 doesn’t help. The problem still exists.
Does the memory decrease by 70MB each time you run the program? Or does this only happen once?
I wasn’t able to observe it on CUDA 7.5 on Ubuntu 14.04:
$ cat t1.cu
//#include <cuda.h>
int main(int argc, char* argv[])
{
//cudaSetDevice(0);
int *XiXj_d;
cudaMalloc(&XiXj_d, 1 * sizeof(int));
cudaFree(XiXj_d);
}
$ nvcc t1.cu -o t1
$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2015 NVIDIA Corporation
Built on Tue_Aug_11_14:27:32_CDT_2015
Cuda compilation tools, release 7.5, V7.5.17
$ free -m
total used free shared buffers cached
Mem: 24105 2156 21949 6 95 1567
-/+ buffers/cache: 493 23612
Swap: 24571 0 24571
bob@03c212a19ace:~/misc$ ./t1
bob@03c212a19ace:~/misc$ free -m
total used free shared buffers cached
Mem: 24105 2154 21951 6 95 1567
-/+ buffers/cache: 491 23614
Swap: 24571 0 24571
$ uname -a
Linux 03c212a19ace 3.13.0-32-generic #57-Ubuntu SMP Tue Jul 15 03:51:08 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
$
I just compiled and ran the test program, and I’m seeing the same thing with Cuda 6.5, NVIDIA driver 346.35, and Ubuntu 14.04 (3.13.0-92-generic). My system seems to lose ~20MB every time I run the program:
$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2014 NVIDIA Corporation
Built on Thu_Jul_17_21:41:27_CDT_2014
Cuda compilation tools, release 6.5, V6.5.12
$ nvcc t1.cu -o t1
$ free -m
total used free shared buffers cached
Mem: 15039 9796 5243 0 183 169
-/+ buffers/cache: 9442 5597
Swap: 0 0 0
$ ./t1
$ free -m
total used free shared buffers cached
Mem: 15039 9815 5224 0 183 169
-/+ buffers/cache: 9461 5578
Swap: 0 0 0
I wonder if this is the cause of the memory leak I’ve been dealing with:
linux - Baffling Memory leak. What is using ~10GB of memory on this system? - Server Fault ?
We have other (older) machines that seem to be running fine, but maybe something changed recently on Ubuntu?
I’m not a Cuda developer, but FWIW, I just added “cudaDeviceReset();” to the end of the test program and recompiled so I could test with cuda-memcheck, and that seems to have made the leak go away:
$ cuda-memcheck --tool memcheck --leak-check full ./t1
========= CUDA-MEMCHECK
========= LEAK SUMMARY: 0 bytes leaked in 0 allocations
========= ERROR SUMMARY: 0 errors
$ free -m
total used free shared buffers cached
Mem: 15039 9933 5106 0 184 169
-/+ buffers/cache: 9579 5460
Swap: 0 0 0
$ ./t1
$ free -m
total used free shared buffers cached
Mem: 15039 9933 5106 0 184 169
-/+ buffers/cache: 9579 5460
Swap: 0 0 0
if you kill the process without letting it finish, the error will still be there, right?
I just came across this: ram - Out of Memory Issue - Ask Ubuntu
After upgrading to version 367.35 of the NVIDIA driver, I can’t reproduce the problem with the test program anymore. Now I just have to wait and see if this fixes the memory leak on my production servers…
Hi @mcongliaro , what GPU do you have?
This is a g2.2xlarge instance on EC2.
# nvidia-smi -q
==============NVSMI LOG==============
Timestamp : Tue Jul 26 20:21:38 2016
Driver Version : 367.35
Attached GPUs : 1
GPU 0000:00:03.0
Product Name : GRID K520
Product Brand : Grid
Display Mode : Disabled
Display Active : Disabled
Persistence Mode : Disabled
Accounting Mode : Disabled
Accounting Mode Buffer Size : 1920
Driver Model
Current : N/A
Pending : N/A
Serial Number : 0321314043755
GPU UUID : GPU-4f723e2d-a35f-51cc-bda4-5c1192b8c968
Minor Number : 0
VBIOS Version : 80.04.D4.00.03
MultiGPU Board : No
Board ID : 0x3
GPU Part Number : 900-12055-0020-000
Inforom Version
Image Version : 2055.0052.00.04
OEM Object : 1.1
ECC Object : N/A
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GPU Virtualization Mode
Virtualization mode : Pass-Through
PCI
Bus : 0x00
Device : 0x03
Domain : 0x0000
Device Id : 0x118A10DE
Bus Id : 0000:00:03.0
Sub System Id : 0x101410DE
GPU Link Info
PCIe Generation
Max : 3
Current : 3
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays since reset : 0
Tx Throughput : N/A
Rx Throughput : N/A
Fan Speed : N/A
Performance State : P0
Clocks Throttle Reasons
Idle : Not Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
Sync Boost : Not Active
Unknown : Not Active
FB Memory Usage
Total : 4036 MiB
Used : 0 MiB
Free : 4036 MiB
BAR1 Memory Usage
Total : 128 MiB
Used : 2 MiB
Free : 126 MiB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 0 %
Encoder : 0 %
Decoder : 0 %
Ecc Mode
Current : N/A
Pending : N/A
ECC Errors
Volatile
Single Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
Total : N/A
Double Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
Total : N/A
Aggregate
Single Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
Total : N/A
Double Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
Total : N/A
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending : N/A
Temperature
GPU Current Temp : 52 C
GPU Shutdown Temp : 97 C
GPU Slowdown Temp : 92 C
Power Readings
Power Management : Supported
Power Draw : 39.44 W
Power Limit : 125.00 W
Default Power Limit : 125.00 W
Enforced Power Limit : 125.00 W
Min Power Limit : 85.00 W
Max Power Limit : 130.00 W
Clocks
Graphics : 797 MHz
SM : 797 MHz
Memory : 2500 MHz
Video : 810 MHz
Applications Clocks
Graphics : N/A
Memory : N/A
Default Applications Clocks
Graphics : N/A
Memory : N/A
Max Clocks
Graphics : 797 MHz
SM : 797 MHz
Memory : 2500 MHz
Video : 810 MHz
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Processes : None
This is a g2.2xlarge instance on EC2.
# nvidia-smi -q
==============NVSMI LOG==============
Timestamp : Tue Jul 26 20:21:38 2016
Driver Version : 367.35
Attached GPUs : 1
GPU 0000:00:03.0
Product Name : GRID K520
Product Brand : Grid
Display Mode : Disabled
Display Active : Disabled
Persistence Mode : Disabled
Accounting Mode : Disabled
Accounting Mode Buffer Size : 1920
Driver Model
Current : N/A
Pending : N/A
Serial Number : 0321314043755
GPU UUID : GPU-4f723e2d-a35f-51cc-bda4-5c1192b8c968
Minor Number : 0
VBIOS Version : 80.04.D4.00.03
MultiGPU Board : No
Board ID : 0x3
GPU Part Number : 900-12055-0020-000
Inforom Version
Image Version : 2055.0052.00.04
OEM Object : 1.1
ECC Object : N/A
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GPU Virtualization Mode
Virtualization mode : Pass-Through
PCI
Bus : 0x00
Device : 0x03
Domain : 0x0000
Device Id : 0x118A10DE
Bus Id : 0000:00:03.0
Sub System Id : 0x101410DE
GPU Link Info
PCIe Generation
Max : 3
Current : 3
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays since reset : 0
Tx Throughput : N/A
Rx Throughput : N/A
Fan Speed : N/A
Performance State : P0
Clocks Throttle Reasons
Idle : Not Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
Sync Boost : Not Active
Unknown : Not Active
FB Memory Usage
Total : 4036 MiB
Used : 0 MiB
Free : 4036 MiB
BAR1 Memory Usage
Total : 128 MiB
Used : 2 MiB
Free : 126 MiB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 0 %
Encoder : 0 %
Decoder : 0 %
Ecc Mode
Current : N/A
Pending : N/A
ECC Errors
Volatile
Single Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
Total : N/A
Double Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
Total : N/A
Aggregate
Single Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
Total : N/A
Double Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
Total : N/A
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending : N/A
Temperature
GPU Current Temp : 52 C
GPU Shutdown Temp : 97 C
GPU Slowdown Temp : 92 C
Power Readings
Power Management : Supported
Power Draw : 39.44 W
Power Limit : 125.00 W
Default Power Limit : 125.00 W
Enforced Power Limit : 125.00 W
Min Power Limit : 85.00 W
Max Power Limit : 130.00 W
Clocks
Graphics : 797 MHz
SM : 797 MHz
Memory : 2500 MHz
Video : 810 MHz
Applications Clocks
Graphics : N/A
Memory : N/A
Default Applications Clocks
Graphics : N/A
Memory : N/A
Max Clocks
Graphics : 797 MHz
SM : 797 MHz
Memory : 2500 MHz
Video : 810 MHz
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Processes : None
Thanks, is it the same as in your servers?
Yeah, I was testing on an instance that looks exactly like my production servers.
Yes, each time. This is serious issue because it is very easy to hang a server from unprivileged user account. Linux OOM killer doesn’t help and server has to be rebooted.
I wasn’t able to reproduce it with CUDA 7.5, Ubuntu 14.04, and driver 352.93, which seems to match your setup. So there seems to be some other missing piece to the puzzle. Anyway others here have suggested that they see different behavior with different drivers, so you might try some other newer drivers besides 352.93.
Beyond that you can always file a bug a developer.nvidia.com
It looks like the same situation. I also checked slabtop, but without answer. I believe that the bug is somewhere linux kernel or nvidia driver.
Surprising is cudaDeviceReset() resolves this issue, but is seems to be workaround. Thanks you mconigliaro for the solution!
I have the problem in a server without root access. I usually run a process with no hup, but sometimes I just need to kill the process.
If I make a call to cudaDeviceReset() (in a new process) after killing the process, will the lost memory be recovered?? Thanks