Huge memory leak

Hi,

I have strange memory leak in linux (4.4.0-31-generic), driver 352.93, Tesla K20m.
Hello world like program allocates about 70MB of memory in OS and doesn’t free after program exit.

Here is the code:
#include <cuda.h>

int main(int argc, char* argv)
{
//cudaSetDevice(0);
int *XiXj_d;

cudaMalloc(&XiXj_d, 1 * sizeof(int));
cudaFree(XiXj_d);

}

Run:
user@cuda2:~/bug$ free -m
total used free shared buffers cached
Mem: 15924 380 15544 1 29 118
-/+ buffers/cache: 232 15692
Swap: 0 0 0

user@cuda2:~/bug$ ./bug

user@cuda2:~/bug$ free -m
total used free shared buffers cached
Mem: 15924 459 15464 1 29 125
-/+ buffers/cache: 305 15619
Swap: 0 0 0

It seems that driver or linux kernel doesn’t free memory. Any idea what is going on here?

Based on your driver version (352.93) I imagine you are using CUDA 7.5

CUDA 7.5 is not compatible with kernel 4.4

The official support matrix for CUDA 7.5 is listed here:

http://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#system-requirements

I would recommend that you switch to an officially supported setup.

CUDA 8RC1 advertises support for Kernel version 4.4 on Ubuntu 16.04

Hi,
I am experiencing the same problem when I use Caffe.
I have a Tesla K80, Ubuntu 14.04.

txbob,

Unfortunately switching to the official kernel (3.13.0-92-generic #139-Ubuntu) for Ubuntu 14.04 doesn’t help. The problem still exists.

Does the memory decrease by 70MB each time you run the program? Or does this only happen once?

I wasn’t able to observe it on CUDA 7.5 on Ubuntu 14.04:

$ cat t1.cu
//#include <cuda.h>

int main(int argc, char* argv[])
{
//cudaSetDevice(0);
int *XiXj_d;

cudaMalloc(&XiXj_d, 1 * sizeof(int));
cudaFree(XiXj_d);
}
$ nvcc t1.cu -o t1
$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2015 NVIDIA Corporation
Built on Tue_Aug_11_14:27:32_CDT_2015
Cuda compilation tools, release 7.5, V7.5.17
$ free -m
             total       used       free     shared    buffers     cached
Mem:         24105       2156      21949          6         95       1567
-/+ buffers/cache:        493      23612
Swap:        24571          0      24571
bob@03c212a19ace:~/misc$ ./t1
bob@03c212a19ace:~/misc$ free -m
             total       used       free     shared    buffers     cached
Mem:         24105       2154      21951          6         95       1567
-/+ buffers/cache:        491      23614
Swap:        24571          0      24571
$ uname -a
Linux 03c212a19ace 3.13.0-32-generic #57-Ubuntu SMP Tue Jul 15 03:51:08 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
$

I just compiled and ran the test program, and I’m seeing the same thing with Cuda 6.5, NVIDIA driver 346.35, and Ubuntu 14.04 (3.13.0-92-generic). My system seems to lose ~20MB every time I run the program:

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2014 NVIDIA Corporation
Built on Thu_Jul_17_21:41:27_CDT_2014
Cuda compilation tools, release 6.5, V6.5.12
$ nvcc t1.cu -o t1
$ free -m
             total       used       free     shared    buffers     cached
Mem:         15039       9796       5243          0        183        169
-/+ buffers/cache:       9442       5597
Swap:            0          0          0
$ ./t1
$ free -m
             total       used       free     shared    buffers     cached
Mem:         15039       9815       5224          0        183        169
-/+ buffers/cache:       9461       5578
Swap:            0          0          0

I wonder if this is the cause of the memory leak I’ve been dealing with:

http://serverfault.com/questions/791838/baffling-memory-leak-what-is-using-10gb-of-memory-on-this-system?

We have other (older) machines that seem to be running fine, but maybe something changed recently on Ubuntu?

I’m not a Cuda developer, but FWIW, I just added “cudaDeviceReset();” to the end of the test program and recompiled so I could test with cuda-memcheck, and that seems to have made the leak go away:

$ cuda-memcheck --tool memcheck --leak-check full ./t1
========= CUDA-MEMCHECK
========= LEAK SUMMARY: 0 bytes leaked in 0 allocations
========= ERROR SUMMARY: 0 errors
$ free -m
             total       used       free     shared    buffers     cached
Mem:         15039       9933       5106          0        184        169
-/+ buffers/cache:       9579       5460
Swap:            0          0          0
$ ./t1
$ free -m
             total       used       free     shared    buffers     cached
Mem:         15039       9933       5106          0        184        169
-/+ buffers/cache:       9579       5460
Swap:            0          0          0

if you kill the process without letting it finish, the error will still be there, right?

I just came across this: http://askubuntu.com/questions/731677/out-of-memory-issue

After upgrading to version 367.35 of the NVIDIA driver, I can’t reproduce the problem with the test program anymore. Now I just have to wait and see if this fixes the memory leak on my production servers…

Hi @mcongliaro, what GPU do you have?

This is a g2.2xlarge instance on EC2.

# nvidia-smi -q

==============NVSMI LOG==============

Timestamp                           : Tue Jul 26 20:21:38 2016
Driver Version                      : 367.35

Attached GPUs                       : 1
GPU 0000:00:03.0
    Product Name                    : GRID K520
    Product Brand                   : Grid
    Display Mode                    : Disabled
    Display Active                  : Disabled
    Persistence Mode                : Disabled
    Accounting Mode                 : Disabled
    Accounting Mode Buffer Size     : 1920
    Driver Model
        Current                     : N/A
        Pending                     : N/A
    Serial Number                   : 0321314043755
    GPU UUID                        : GPU-4f723e2d-a35f-51cc-bda4-5c1192b8c968
    Minor Number                    : 0
    VBIOS Version                   : 80.04.D4.00.03
    MultiGPU Board                  : No
    Board ID                        : 0x3
    GPU Part Number                 : 900-12055-0020-000
    Inforom Version
        Image Version               : 2055.0052.00.04
        OEM Object                  : 1.1
        ECC Object                  : N/A
        Power Management Object     : N/A
    GPU Operation Mode
        Current                     : N/A
        Pending                     : N/A
    GPU Virtualization Mode
        Virtualization mode         : Pass-Through
    PCI
        Bus                         : 0x00
        Device                      : 0x03
        Domain                      : 0x0000
        Device Id                   : 0x118A10DE
        Bus Id                      : 0000:00:03.0
        Sub System Id               : 0x101410DE
        GPU Link Info
            PCIe Generation
                Max                 : 3
                Current             : 3
            Link Width
                Max                 : 16x
                Current             : 16x
        Bridge Chip
            Type                    : N/A
            Firmware                : N/A
        Replays since reset         : 0
        Tx Throughput               : N/A
        Rx Throughput               : N/A
    Fan Speed                       : N/A
    Performance State               : P0
    Clocks Throttle Reasons
        Idle                        : Not Active
        Applications Clocks Setting : Not Active
        SW Power Cap                : Not Active
        HW Slowdown                 : Not Active
        Sync Boost                  : Not Active
        Unknown                     : Not Active
    FB Memory Usage
        Total                       : 4036 MiB
        Used                        : 0 MiB
        Free                        : 4036 MiB
    BAR1 Memory Usage
        Total                       : 128 MiB
        Used                        : 2 MiB
        Free                        : 126 MiB
    Compute Mode                    : Default
    Utilization
        Gpu                         : 0 %
        Memory                      : 0 %
        Encoder                     : 0 %
        Decoder                     : 0 %
    Ecc Mode
        Current                     : N/A
        Pending                     : N/A
    ECC Errors
        Volatile
            Single Bit
                Device Memory       : N/A
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Texture Shared      : N/A
                Total               : N/A
            Double Bit
                Device Memory       : N/A
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Texture Shared      : N/A
                Total               : N/A
        Aggregate
            Single Bit
                Device Memory       : N/A
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Texture Shared      : N/A
                Total               : N/A
            Double Bit
                Device Memory       : N/A
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Texture Shared      : N/A
                Total               : N/A
    Retired Pages
        Single Bit ECC              : N/A
        Double Bit ECC              : N/A
        Pending                     : N/A
    Temperature
        GPU Current Temp            : 52 C
        GPU Shutdown Temp           : 97 C
        GPU Slowdown Temp           : 92 C
    Power Readings
        Power Management            : Supported
        Power Draw                  : 39.44 W
        Power Limit                 : 125.00 W
        Default Power Limit         : 125.00 W
        Enforced Power Limit        : 125.00 W
        Min Power Limit             : 85.00 W
        Max Power Limit             : 130.00 W
    Clocks
        Graphics                    : 797 MHz
        SM                          : 797 MHz
        Memory                      : 2500 MHz
        Video                       : 810 MHz
    Applications Clocks
        Graphics                    : N/A
        Memory                      : N/A
    Default Applications Clocks
        Graphics                    : N/A
        Memory                      : N/A
    Max Clocks
        Graphics                    : 797 MHz
        SM                          : 797 MHz
        Memory                      : 2500 MHz
        Video                       : 810 MHz
    Clock Policy
        Auto Boost                  : N/A
        Auto Boost Default          : N/A
    Processes                       : None

Thanks, is it the same as in your servers?

Yeah, I was testing on an instance that looks exactly like my production servers.

Yes, each time. This is serious issue because it is very easy to hang a server from unprivileged user account. Linux OOM killer doesn’t help and server has to be rebooted.

I wasn’t able to reproduce it with CUDA 7.5, Ubuntu 14.04, and driver 352.93, which seems to match your setup. So there seems to be some other missing piece to the puzzle. Anyway others here have suggested that they see different behavior with different drivers, so you might try some other newer drivers besides 352.93.

Beyond that you can always file a bug a developer.nvidia.com

It looks like the same situation. I also checked slabtop, but without answer. I believe that the bug is somewhere linux kernel or nvidia driver.

Surprising is cudaDeviceReset() resolves this issue, but is seems to be workaround. Thanks you mconigliaro for the solution!

I have the problem in a server without root access. I usually run a process with no hup, but sometimes I just need to kill the process.
If I make a call to cudaDeviceReset() (in a new process) after killing the process, will the lost memory be recovered?? Thanks