how to effectively free large memory allocation

When working with large memory allocation, I face the situation that it ran out of memory even if I used cudaFree() function for
each cudaMalloc( ). Usually, I would reboot my desktop to free all the memory allocated, it works fine but it bothers me to do the reboot. I guess other people also have this kind of situation like mine. e.g., the bottom comment in this post

I found a related post on stack overflow which provides some evidences why cudaFree() doesn’t work.

My question is:
How to effectively free large memory allocation other than rebooting?

Thank you.

Can you post a minimal, self-contained, buildable program that reproduces the behavior you are reporting? It is not at all clear what sequence of events leads up to your system “running out of memory”. What does “reboot the desktop” mean? What operating system are you on? What GPU? What version of CUDA? What driver?

It shouldn’t be necessary to restart the computer. If you terminate an application, all of the memory should be freed, and the pool should return to its normal, unallocated state.

If you have to reboot the machine, it likely means that you have a CPU process that was using the GPU that has not properly been terminated. nvidia-smi may give some clue as to which process is still “holding on” to GPU memory. In linux you can use variations of the ps command to also identify unexpected processes.

Other alternatives to rebooting might be the use of cudaDeviceReset() (assuming you call it before application exit) and/or the nvidia-smi -r command (use nvidia-smi --help for exact command usage)

@njuffa thank you for your reply and sorry for any confusion.

Below is the configuration of my system:
OS: Linux
CUDA version: CUDA 7.0
Driver version: 346.96
GPU: Nvidia NVS 315

Reboot the desktop means restart the computer.

Here is a little program which leads to my system “running out of memory”.
cudaMalloc a size of 20M floating point array, and then deallocate the array using cudaFree(). The total size of the array is 20M*4 = 80MB. The total memory in Nvidia NVS 315 is 1 GB.

Run this simple program 10 consecutive times, it gets to the situation where run out of memory.

This post also has a simple vector addition example to illustrate this behavior.

From the post, the system fails to allocate device vector (running out of memory) when 84 of vector addition already running

@txbob Thank you for the information.

I also thought of cudaDeviceReset(), probably I put it in a wrong place rather than before application exit.

nvidia-smi is a good tool to control my GPU which I barely use. Thank you.

It’s not clear how you are running this application. Your description seems to indicate that you are running 84 instances concurrently, in which case we would obviously see 84 times the GPU memory use of a single instance, and depending on how much memory your GPU has and a single instance of the program uses the N+1 instance of the the app would encounter an out-of-memory condition.

If, on the other hand, you run the app 84 times consecutively, the free GPU memory at the end of that exercise should equal the free GPU memory before the first run, unless the driver has a memory leak, but I see no evidence of that when running on my system (which is very different from yours).

I don’t witness any such behavior, if you allow the application to exit.

I have a NVS310 (has 512MB of memory instead of 1GB like NVS315) on CUDA 7 on Centos 6.2:

$ /usr/local/cuda/samples/bin/x86_64/linux/release/deviceQuery
/usr/local/cuda/samples/bin/x86_64/linux/release/deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "NVS 310"
  CUDA Driver Version / Runtime Version          7.0 / 7.0
  CUDA Capability Major/Minor version number:    2.1
  Total amount of global memory:                 512 MBytes (536543232 bytes)
  ( 1) Multiprocessors, ( 48) CUDA Cores/MP:     48 CUDA Cores
  GPU Max Clock rate:                            1046 MHz (1.05 GHz)
  Memory Clock rate:                             875 Mhz
  Memory Bus Width:                              64-bit
  L2 Cache Size:                                 65536 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65535), 3D=(2048, 2048, 2048)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 32768
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  1536
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (65535, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 1 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 4 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 7.0, CUDA Runtime Version = 7.0, NumDevs = 1, Device0 = NVS 310
Result = PASS

I created the following program according to your description:

#include <stdio.h>
#include <stdlib.h>
#define DSIZE_MAX 100000000
#define cudaCheckErrors(msg) \
    do { \
        cudaError_t __err = cudaGetLastError(); \
        if (__err != cudaSuccess) { \
            fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
                msg, cudaGetErrorString(__err), \
                __FILE__, __LINE__); \
            fprintf(stderr, "*** FAILED - ABORTING\n"); \
            exit(1); \
        } \
    } while (0)

int main(int argc, char *argv[]){
  if (argc < 2) {printf("must specify allocation size on command line\n"); return 1;}
  const int dsize = atoi(argv[1]);
  if ((dsize < 1)||(dsize > DSIZE_MAX)) {printf("invalid size %d\n", dsize); return 1;}

  int *data;
  cudaMalloc(&data, dsize*sizeof(int));
  cudaCheckErrors("cudaMalloc fail");
  cudaCheckErrors("cudaFree fail");
  return 0;

I compiled with:

nvcc -o t906

I can run it “many” times without any issues:

$ ./t906 20000000
$ ./t906 20000000
$ ./t906 20000000
$ ./t906 20000000
$ ./t906 20000000
$ ./t906 20000000
$ ./t906 20000000
$ ./t906 20000000
$ ./t906 20000000
$ ./t906 20000000
$ ./t906 20000000
$ ./t906 20000000
$ ./t906 20000000
$ ./t906 20000000
$ ./t906 20000000
$ ./t906 20000000
$ ./t906 20000000
$ ./t906 20000000
$ ./t906 20000000
$ ./t906 20000000
$ ./t906 20000000

and at the completion of the above, nvidia-smi still reports approximately zero memory used:

$ nvidia-smi
Wed Nov  4 10:31:35 2015
| NVIDIA-SMI 346.46     Driver Version: 346.46         |
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  NVS 310             On   | 0000:04:00.0     N/A |                  N/A |
| 30%   38C    P8    N/A /  N/A |      3MiB /   511MiB |     N/A      Default |

| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|    0              C   Not Supported                                         |

Note that the thread you linked for the vector add app is acknowledging that the problem does not occur if you allow the application to exit (or use cudaDeviceReset()). The problem being discussed there is likely some sort of memory fragmentation issue during application runtime. When an application exits properly, all of its resources are freed properly and there should be no residual fragmentation issues. And I don’t witness any with my test case according to your description.

I think I am catching on. So here is what I see (Windows 7, CUDA 7.5, driver 354.13):

Even after cudaFree() has been called on all allocations and cudaDeviceReset() has been called, but while the application is waiting for a key press to terminate, nvidia-smi shows the allocated GPU memory still in use. Only when the app exits after the keypress does nvidia-smi show the memory as free. I tried de-allocating the GPU memory blocks in various orders, and with and without the call to cudaDeviceReset(), the results are consistent. So I don’t think there is an issue with fragmentation. Since all allocated memory blocks are freed, fragmentation should not come into play?

While I was aware that CUDA may physically free memory somewhat lazily, which has been the case from the earliest days of CUDA (presumably since there are sub-allocators stacked upon sub-allocators throughout the entire driver and run-time stack), I am surprised that calling cudaDeviceReset() does not cause the GPU memory allocated by the app to be returned for good at that point. That might mean an allocator below the CUDA driver level is holding on to the memory.

This analysis assumes that one can trust the output of nvidia-smi with regard to memory use reported. But since memory is freed when the application terminates, I do not see evidence of a GPU memory leak that would require a reboot of the machine.

[Later:] Checking the thread on Stackoverflow referenced above, I see that it nicely addresses the reason why sub-allocators generally do not return blocks allocated from a parent allocator immediately after all sub-allocations in that block have been freed: allocating blocks from the parent allocator is expensive, and so the sub-allocator hangs on the block to be able to satisfy future allocation requests made to it at low cost.

Your comments and these threads I linked allows me to connect them together to be able to understand them.

One important lesson is: It isn’t required to reboot the machine if the process is properly terminated.

Thank you.