GPU memory cannot be released

Hello.

Recently I ran into a weird problem when using PyTorch multi-GPU training. I have 8 GPU cards in the machine. After running a PyTorch training program for some time, I stopped it by Ctrl+C and then I checked the cards using nvidia-smi. Everything looked good.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.46                 Driver Version: 390.46                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-PCIE...  Off  | 00000000:1A:00.0 Off |                    0 |
| N/A   32C    P0    25W / 250W |     11MiB / 16160MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-PCIE...  Off  | 00000000:1F:00.0 Off |                    0 |
| N/A   34C    P0    25W / 250W |     11MiB / 16160MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-PCIE...  Off  | 00000000:20:00.0 Off |                    0 |
| N/A   33C    P0    25W / 250W |     11MiB / 16160MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-PCIE...  Off  | 00000000:21:00.0 Off |                    0 |
| N/A   33C    P0    23W / 250W |     11MiB / 16160MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   4  Tesla V100-PCIE...  Off  | 00000000:B2:00.0 Off |                    0 |
| N/A   32C    P0    26W / 250W |     11MiB / 16160MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   5  Tesla V100-PCIE...  Off  | 00000000:B3:00.0 Off |                    0 |
| N/A   35C    P0    26W / 250W |     11MiB / 16160MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   6  Tesla V100-PCIE...  Off  | 00000000:B4:00.0 Off |                    0 |
| N/A   34C    P0    25W / 250W |     11MiB / 16160MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   7  Tesla V100-PCIE...  Off  | 00000000:B5:00.0 Off |                    0 |
| N/A   35C    P0    25W / 250W |     11MiB / 16160MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

I also wrote a check.cu to check the GPU memory.

#include <iostream>
#include "cuda.h"
#include "cuda_runtime_api.h"
  
using namespace std;
  
int main( void ) {
    int num_gpus;
    size_t free, total;
    cudaGetDeviceCount( &num_gpus );
    for ( int gpu_id = 0; gpu_id < num_gpus; gpu_id++ ) {
        cudaSetDevice( gpu_id );
        int id;
        cudaGetDevice( &id );
        cudaMemGetInfo( &free, &total );
        cout << "GPU " << id << " memory: free=" << free << ", total=" << total << endl;
    }
    return 0;
}

The output also looked good.

GPU 0 memory: free=16488464384, total=16945512448
GPU 1 memory: free=16488464384, total=16945512448
GPU 2 memory: free=16488464384, total=16945512448
GPU 3 memory: free=16488464384, total=16945512448
GPU 4 memory: free=16488464384, total=16945512448
GPU 5 memory: free=16488464384, total=16945512448
GPU 6 memory: free=16488464384, total=16945512448
GPU 7 memory: free=16488464384, total=16945512448

Then I moved forward to try to create a one-element CUDA Tensor. And an OOM error happened.

import torch
import numpy as np
 
if __name__ == '__main__':
    x = np.random.randn(1)
    try:
        t = torch.cuda.FloatTensor(x)
        print('Success!')
    except Exception as e:
        print(e)

GPU 2 seemed to be OOM.

$ CUDA_VISIBLE_DEVICES=0 python3 check.py 
Success!
$ CUDA_VISIBLE_DEVICES=1 python3 check.py 
Success!
$ CUDA_VISIBLE_DEVICES=2 python3 check.py 
CUDA error: out of memory
$ CUDA_VISIBLE_DEVICES=3 python3 check.py 
Success!
$ CUDA_VISIBLE_DEVICES=4 python3 check.py 
Success!
$ CUDA_VISIBLE_DEVICES=5 python3 check.py 
Success!
$ CUDA_VISIBLE_DEVICES=6 python3 check.py 
Success!
$ CUDA_VISIBLE_DEVICES=7 python3 check.py 
Success!

I also tried to initialize some data on the GPU cards using Tensorflow. And GPU 2 also complained about OOM. So I believed that the memory of GPU 2 was not released. I tried to run

killall python

to kill all the python processes. But the above problem still occurred. I tried to reinstall PyTorch and it did not fix the problem. I once tried to restart the machine and it worked. But it would not be feasible to restart the machine every time since I was running it on a server…

Could anyone please shed some lights on this problem? By the way, I was using PyTorch 0.4.1 and CUDA 9.0. And the program I ran was https://github.com/CSAILVision/semantic-segmentation-pytorch.