Make_default_context() wasn't able to create a context

Description

I am facing the issue

“RuntimeError: make_default_context() wasn’t able to create a context on any of the 1 detected devices “

This process runs in kubernetes inside a pod. Sometimes restarting the pod’s resolves the issue and sometimes we need to move the process to a different gpu.


Traceback
from FR.src.RetinaFace_trt import Retinaface_trt
  File "/data/JF/FR/src/RetinaFace_trt.py", line 14, in <module>
    import pycuda.autoinit
  File "/usr/local/lib/python3.6/dist-packages/pycuda/autoinit.py", line 9, in <module>
    context = make_default_context()
  File "/usr/local/lib/python3.6/dist-packages/pycuda/tools.py", line 204, in make_default_context
    "on any of the %d detected devices" % ndevices)
RuntimeError: make_default_context() wasn't able to create a context on any of the 1 detected devices

Environment

TensorRT Version: tensorrt==7.1.3.4
GPU Type: T4
Nvidia Driver Version: 460.32.03
CUDA Version: 11.2
CUDNN Version: 8.0.4
Operating System + Version: Ubuntu 18.04.6 LTS
Python Version (if applicable): 3.6.9
TensorFlow Version (if applicable):
PyTorch Version (if applicable): 1.4.0
Baremetal or Container (if container which image + tag):

import ctypes
import os
import random
import sys
import threading
import time
import dlib
import cv2
import numpy as np
import pycuda.autoinit
import pycuda.driver as cuda
import tensorrt as trt
import torch
import torchvision

Getting error on the import(pycuda.autoinit) itself

It seems that PyCUDA is unable to detect any available devices to create a context on, and so it cannot initialize. Check that PyCUDA is properly installed and configured on your system.
You can try reinstalling PyCUDA and making sure that your environment variables are set up correctly.
Also, you can try using CUDA from the torch and removing Python.
https://pytorch.org/docs/stable/cuda.html

The issue is not cuda driver because a simple restart of pods resolves this issue temporary.

Nvidia team kindly suggest permanent solution for this.
Reinstalling of drivers is not the way forward.

Regards,
Amit Dube

This could be due to not having enough free GPU memory available to create a new context.
Check the usage of the GPU to know if it is overloaded or if there is not enough free memory available.
You can also check for other processes using the GPU and terminate them if necessary.

Thank you.