HELP: NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running

SCNU_ELON · July 19, 2023, 7:35am

Nowadays I have installed a nvidid driver in a server machine，The driver is installed through the Ubuntu philosophy system, not the“.run” file downloaded from the official website. But I encountered a troublesome problem: as shown in the title:

NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

My environment is:

GPU：A100
Securityboot：Disabled
OS：Ubuntu20.04
Driver Version：nvidia-535-server （from Ubuntu recommended）

When I was using the driver installation on the official website, the following issues occurred:

An alternate method of installing the NVIDIA driver was detected. (This is usually a package provided by your distributor.) A driver installed via that method may integrate better with your system than a driver installed by nvidia-installer.

Besides,The following issues have occurred again：

Unable to load the kernel module 'nvidia.ko'.  This happens most      
         frequently when this kernel module was built against the wrong or     
         improperly configured kernel sources, with a version of gcc that      
         differs from the one used to build the target kernel, or if another
         driver, such as nouveau, is present and prevents the NVIDIA kernel
         module from obtaining ownership of the NVIDIA device(s), or no NVIDIA 
         device installed in this system is supported by this NVIDIA Linux
         graphics driver release.                                              

         Please see the log entries 'Kernel module load error' and 'Kernel     
         messages' at the end of the file '/var/log/nvidia-installer.log' for  
         more information.

And then I grabbed it ‘/var/log/nvidia-installer.log’
It has already been uploaded by me.
nvidia-installer.log (35.7 KB)

I have also tried other forum methods, such as:

Many other posts say that, if you update the Linux kernel, the Nvidia driver that was installed during the older kernel period will not compatible with the updated kernel. The solution is to use use (Dynamic Kernel Module Support) to install a new kernel-compatible files for the drivers, like this:dkms:

sudo apt-get install dkms
sudo dkms install -m nvidia -v <driver_version>

The driver version can be found by the command .whereis nvidia
However, this solution didn’t work for me either, because even my driver was installed after the kernel update.

So, do you have any ideas?

WillDennis · October 16, 2023, 1:49pm

What does dpkg -l | grep nvidia yield? It may be that you have more than one driver package versions installed.

SCNU_ELON · October 18, 2023, 4:43pm

Thank you for your answer, but my A100 uninstall now. when my A100 install, I will try to it again!

mona.jalal · February 28, 2024, 8:38pm

How should I fix it?


$  dpkg -l | grep nvidia
ii  gpustat                                    0.6.0-1                                 all          pretty nvidia device monitor
ii  libnvidia-cfg1-530:amd64                   530.30.02-0ubuntu1                      amd64        NVIDIA binary OpenGL/GLX configuration library
ii  libnvidia-common-515                       515.105.01-0ubuntu1                     all          Shared files used by the NVIDIA libraries
ii  libnvidia-common-530                       530.30.02-0ubuntu1                      all          Shared files used by the NVIDIA libraries
rc  libnvidia-compute-515:amd64                515.105.01-0ubuntu0.22.04.1             amd64        NVIDIA libcompute package
ii  libnvidia-compute-530:amd64                530.30.02-0ubuntu1                      amd64        NVIDIA libcompute package
ii  libnvidia-compute-530:i386                 530.30.02-0ubuntu1                      i386         NVIDIA libcompute package
iU  libnvidia-container-tools                  1.14.5-1                                amd64        NVIDIA container runtime library (command-line tools)
iU  libnvidia-container1:amd64                 1.14.5-1                                amd64        NVIDIA container runtime library
ii  libnvidia-decode-530:amd64                 530.30.02-0ubuntu1                      amd64        NVIDIA Video Decoding runtime libraries
ii  libnvidia-decode-530:i386                  530.30.02-0ubuntu1                      i386         NVIDIA Video Decoding runtime libraries
ii  libnvidia-encode-530:amd64                 530.30.02-0ubuntu1                      amd64        NVENC Video Encoding runtime library
ii  libnvidia-encode-530:i386                  530.30.02-0ubuntu1                      i386         NVENC Video Encoding runtime library
ii  libnvidia-extra-530:amd64                  530.30.02-0ubuntu1                      amd64        Extra libraries for the NVIDIA driver
ii  libnvidia-fbc1-530:amd64                   530.30.02-0ubuntu1                      amd64        NVIDIA OpenGL-based Framebuffer Capture runtime library
ii  libnvidia-fbc1-530:i386                    530.30.02-0ubuntu1                      i386         NVIDIA OpenGL-based Framebuffer Capture runtime library
ii  libnvidia-gl-530:amd64                     530.30.02-0ubuntu1                      amd64        NVIDIA OpenGL/GLX/EGL/GLES GLVND libraries and Vulkan ICD
ii  libnvidia-gl-530:i386                      530.30.02-0ubuntu1                      i386         NVIDIA OpenGL/GLX/EGL/GLES GLVND libraries and Vulkan ICD
rc  nvidia-compute-utils-515                   515.105.01-0ubuntu0.22.04.1             amd64        NVIDIA compute utilities
ii  nvidia-compute-utils-530                   530.30.02-0ubuntu1                      amd64        NVIDIA compute utilities
ii  nvidia-container-toolkit                   1.13.1-1                                amd64        NVIDIA Container toolkit
ii  nvidia-container-toolkit-base              1.13.1-1                                amd64        NVIDIA Container Toolkit Base
rc  nvidia-dkms-515                            515.105.01-0ubuntu0.22.04.1             amd64        NVIDIA DKMS package
ii  nvidia-dkms-530                            530.30.02-0ubuntu1                      amd64        NVIDIA DKMS package
ii  nvidia-docker2                             2.13.0-1                                all          nvidia-docker CLI wrapper
ii  nvidia-driver-530                          530.30.02-0ubuntu1                      amd64        NVIDIA driver metapackage
rc  nvidia-kernel-common-515                   515.105.01-0ubuntu0.22.04.1             amd64        Shared files used with the kernel module
ii  nvidia-kernel-common-530                   530.30.02-0ubuntu1                      amd64        Shared files used with the kernel module
ii  nvidia-kernel-source-530                   530.30.02-0ubuntu1                      amd64        NVIDIA kernel source package
ii  nvidia-modprobe                            530.30.02-0ubuntu1                      amd64        Load the NVIDIA kernel driver and create device files
ii  nvidia-prime                               0.8.17.1                                all          Tools to enable NVIDIA's Prime
ii  nvidia-settings                            530.30.02-0ubuntu1                      amd64        Tool for configuring the NVIDIA graphics driver
ii  nvidia-utils-530                           530.30.02-0ubuntu1                      amd64        NVIDIA driver support binaries
ii  screen-resolution-extra                    0.18.2                                  all          Extension for the nvidia-settings control panel
ii  xserver-xorg-video-nvidia-530              530.30.02-0ubuntu1                      amd64        NVIDIA binary Xorg driver

(base) mona@DOS:~$ nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

and

(base) mona@DOS:~$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Jun__8_16:49:14_PDT_2022
Cuda compilation tools, release 11.7, V11.7.99
Build cuda_11.7.r11.7/compiler.31442593_0

(base) mona@DOS:~$ uname -a
Linux DOS 6.5.0-21-generic #21~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Fri Feb  9 13:32:52 UTC 2 x86_64 x86_64 x86_64 GNU/Linux
(base) mona@DOS:~$ lsb_release -a
LSB Version:	core-11.1.0ubuntu4-noarch:security-11.1.0ubuntu4-noarch
Distributor ID:	Ubuntu
Description:	Ubuntu 22.04.4 LTS
Release:	22.04
Codename:	jammy

This problem happened after the training phase of nvidia tao for CenterPose finished.

github.com

NVIDIA/tao_tutorials/blob/6dc77b3daf47e957d2333d2b3d80e490099ee4e6/notebooks/tao_launcher_starter_kit/centerpose/centerpose.ipynb#L17


      
           "papermill": {
            "duration": 0.005475,
            "end_time": "2023-09-28T06:31:36.113528",
            "exception": false,
            "start_time": "2023-09-28T06:31:36.108053",
            "status": "completed"
           },
           "tags": []
          },
          "source": [
           "# Category-level Object Pose Estimation using TAO CenterPose\n",
           "\n",
           "Transfer learning is the process of transferring learned features from one application to another. It is a commonly used training technique where you use a model trained on one task and re-train to use it on a different task. \n",
           "\n",
           "Train Adapt Optimize (TAO) Toolkit  is a simple and easy-to-use Python based AI toolkit for taking purpose-built AI models and customizing them with users' own data.\n",
           "\n",
           "<img align=\"center\" src=\"https://d29g4g2dyqv443.cloudfront.net/sites/default/files/akamai/TAO/tlt-tao-toolkit-bring-your-own-model-diagram.png\" width=\"1080\">\n",
           "\n",
           "## What is CenterPose?\n",
           "\n",
           "[CenterPose](https://arxiv.org/abs/2109.06161) a single-stage, keypoint-based approach for category-level object pose estimation, which operates on unknown object instances within a known category using a single RGB image input. The proposed network performs 2D object detection, detects 2D keypoints, estimates 6-DoF pose, and regresses relative 3D bounding cuboid dimensions.\n",

Feel free to let me know if further information may be needed.

My end goal is to run this notebook in offline mode and train, evaluate, and visualize centerpose. Since training couldn’t be done in the notebook (notebook keeps crashing), I converted it to an offline Python script listed below.


(base) mona@DOS:~/tao_tutorials/notebooks/tao_launcher_starter_kit/centerpose$ cat train_centerpose.py 
import os
import numpy as np
import cv2
import glob
import tqdm
import json
import requests
import shutil
import tensorflow as tf
import warnings
from scipy.spatial.transform import Rotation as R
import subprocess
import matplotlib.pyplot as plt
from math import ceil

os.environ["PATH"]="{}/ngccli/ngc-cli:{}".format(os.getenv("LOCAL_PROJECT_DIR", ""), os.getenv("PATH", ""))

os.environ["LOCAL_PROJECT_DIR"] = "/hdd/tao-experiments"
os.environ["HOST_DATA_DIR"] = os.path.join(os.getenv("LOCAL_PROJECT_DIR", os.getcwd()), "data", "centerpose")
os.environ["HOST_RESULTS_DIR"] = os.path.join(os.getenv("LOCAL_PROJECT_DIR", os.getcwd()), "centerpose", "results")

# Set this path if you don't run the notebook from the samples directory.
# %env NOTEBOOK_ROOT=~/tao-samples/centerpose

# The sample spec files are present in the same path as the downloaded samples.
os.environ["HOST_SPECS_DIR"] = os.path.join(
    os.getenv("NOTEBOOK_ROOT", os.getcwd()),
    "specs"
)


print('host specs dir: ', os.environ["HOST_SPECS_DIR"])
# The data is saved here

DATA_DIR = '/data'
MODEL_DIR = '/model'
SPECS_DIR = '/specs'
RESULTS_DIR = '/results'

mounts_file = os.path.expanduser("~/.tao_mounts.json")
tao_configs = {
   "Mounts":[
         # Mapping the Local project directory
        {
            "source": os.environ["LOCAL_PROJECT_DIR"],
            "destination": "/workspace/tao-experiments"
        },
       {
           "source": os.environ["HOST_DATA_DIR"],
           "destination": "/data"
       },
       {
           "source": os.environ["HOST_SPECS_DIR"],
           "destination": "/specs"
       },
       {
           "source": os.environ["HOST_RESULTS_DIR"],
           "destination": "/results"
       }
   ],
   "DockerOptions": {
        "shm_size": "16G",
        "ulimits": {
            "memlock": -1,
            "stack": 67108864
         },
        "user": "{}:{}".format(os.getuid(), os.getgid()),
        "network": "host"
   }
}

with open(mounts_file, "w") as mfile:
    json.dump(tao_configs, mfile, indent=4)


cmd = "cat ~/.tao_mounts.json"
result = subprocess.run(cmd, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
print(result.stdout.decode("utf-8"))

cmd = "tao info --verbose"
result = subprocess.run(cmd, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
print(result.stdout.decode("utf-8"))



def get_image(feature, shape=None):
    """Decode the tensorflow image example."""
    image = cv2.imdecode(
        np.asarray(bytearray(feature.bytes_list.value[0]), dtype=np.uint8),
        cv2.IMREAD_ANYCOLOR | cv2.IMREAD_ANYDEPTH)
    if len(image.shape) > 2 and image.shape[2] > 1:
        image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
    if shape is not None:
        image = cv2.resize(image, shape)
    return image

def parse_plane(example):
    """Parses plane from a tensorflow example."""
    fm = example.features.feature
    if "plane/center" in fm and "plane/normal" in fm:
        center = fm["plane/center"].float_list.value
        center = np.asarray(center)
        normal = fm["plane/normal"].float_list.value
        normal = np.asarray(normal)
        return center, normal
    else:
        return None
    
def parse_example(example):
    """Parse the image example data"""
    fm = example.features.feature

    # Extract images, setting the input shape for Objectron Dataset
    image = get_image(fm["image/encoded"], shape=(600, 800))
    filename = fm["image/filename"].bytes_list.value[0].decode("utf-8")
    filename = filename.replace('/', '_')
    image_id = np.asarray(fm["image/id"].int64_list.value)[0]

    label = {}
    visibilities = fm["object/visibility"].float_list.value
    visibilities = np.asarray(visibilities)
    index = visibilities > 0.1

    if "point_2d" in fm:
        points_2d = fm["point_2d"].float_list.value
        points_2d = np.asarray(points_2d).reshape((-1, 9, 3))[..., :2]

    if "point_3d" in fm:
        points_3d = fm["point_3d"].float_list.value
        points_3d = np.asarray(points_3d).reshape((-1, 9, 3))

    if "object/scale" in fm:
        obj_scale = fm["object/scale"].float_list.value
        obj_scale = np.asarray(obj_scale).reshape((-1, 3))

    if "object/translation" in fm:
        obj_trans = fm["object/translation"].float_list.value
        obj_trans = np.asarray(obj_trans).reshape((-1, 3))

    if  "object/orientation" in fm:
        obj_ori = fm["object/orientation"].float_list.value
        obj_ori = np.asarray(obj_ori).reshape((-1, 3, 3))

    label["2d_instance"] = points_2d[index]
    label["3d_instance"] = points_3d[index]
    label["scale_instance"] = obj_scale[index]
    label["translation"] = obj_trans[index]
    label["orientation"] = obj_ori[index]
    label["image_id"] = image_id
    label["visibility"] = visibilities[index]
    label['ORI_INDEX'] = np.argwhere(index).flatten()
    label['ORI_NUM_INSTANCE'] = len(index)
    return image, label, filename

def parse_camera(example):
    """Parse the camera calibration data"""
    fm = example.features.feature
    if "camera/projection" in fm:
        proj = fm["camera/projection"].float_list.value
        proj = np.asarray(proj).reshape((4, 4))
    else:
        proj = None
        
    if "camera/view" in fm:
        view = fm["camera/view"].float_list.value
        view = np.asarray(view).reshape((4, 4))
    else:
        view = None
    
    if "camera/intrinsics" in fm:
        intrinsic = fm["camera/intrinsics"].float_list.value
        intrinsic = np.asarray(intrinsic).reshape((3, 3))
    else:
        intrinsic = None
    return proj, view, intrinsic

def partition(lst, n):
    """Equally split the video lists."""
    division = len(lst) / float(n) if n else len(lst)
    return [lst[int(np.round(division * i)): int(np.round(division * (i + 1)))] for i in range(n)]


OBJECTRON_BUCKET = "gs://objectron/v1/records_shuffled"
PUBLIC_URL = "https://storage.googleapis.com/objectron"
SAVE_DIR = os.getenv("HOST_DATA_DIR", os.getcwd())

# Please add the "test" into the array if you want to evaluate the whole testing set. It requires at least 30GB to download the bike category. 
# DATA_DISTRIBUTION = ['train', 'val', 'test']
DATA_DISTRIBUTION = ['train', 'val']

# Note that the sample spec is not meant to produce SOTA accuracy on Objectron dataset. 
# To reproduce SOTA, you should set `TRAIN_FR` as 15 and `DATA_DOWNLOAD` as -1 to match the original parameters.
TRAIN_FR = 30
VAL_FR = 60
TEST_FR = 1
DATA_DOWNLOAD = 10000

# Please select the specific categories that you want to train the CenterPose model. 
# CATEGORIES = ['bike', 'book', 'bottle', 'camera', 'cereal_box', 'chair', 'laptop', 'shoe']
CATEGORIES = ['bike']

memory_free = shutil.disk_usage(SAVE_DIR).free
if len(CATEGORIES) >= 8 and memory_free < 4.4E12:
    warnings.warn("No enough space for downloading all 8 categories.")
'''
for c in CATEGORIES:
    for dist in DATA_DISTRIBUTION:
        # Download the tfrecord files
        if dist in ['test', 'val']:
            eval_data = f'/{c}/{c}_test*'
            blob_path = PUBLIC_URL + f"/v1/index/{c}_annotations_test"
        elif dist in ['train']:
            eval_data = f'/{c}/{c}_train*'
            blob_path = PUBLIC_URL + f"/v1/index/{c}_annotations_train"
        else:
            raise ValueError("No specific data distribution settings.")

        eval_shards = tf.io.gfile.glob(OBJECTRON_BUCKET + eval_data)
        ds = tf.data.TFRecordDataset(eval_shards).take(DATA_DOWNLOAD)

        with tf.io.TFRecordWriter(f'{SAVE_DIR}/{c}_{dist}.tfrecord') as file_writer:
            for serialized in tqdm.tqdm(ds): 
                example = tf.train.Example.FromString(serialized.numpy())
                record_bytes = example.SerializeToString()
                file_writer.write(record_bytes)

        # Get the video ids
        video_ids = requests.get(blob_path).text
        video_ids = [i.replace('/', '_') for i in video_ids.split('\n')]
        
        # Work on a subset of the videos for each round, where the subset is equally split
        video_ids_split = partition(video_ids, int(np.floor(len(video_ids) / int(len(video_ids) / 2))))

        # Decode the tfrecord files
        tfdata = f'{SAVE_DIR}/{c}_{dist}*'
        eval_shards = tf.io.gfile.glob(tfdata)

        new_ds = tf.data.TFRecordDataset(eval_shards).take(-1)

        for subset in video_ids_split:
            videos = {}
            for serialized in tqdm.tqdm(new_ds):

                example = tf.train.Example.FromString(serialized.numpy())

                # Group according to video_id & image_id
                fm = example.features.feature
                filename = fm["image/filename"].bytes_list.value[0].decode("utf-8")
                video_id = filename.replace('/', '_')
                image_id = np.asarray(fm["image/id"].int64_list.value)[0]
                
                # Sometimes, data is too big to save, so we only focus on a small subset instead.
                if video_id not in subset:
                    continue
                
                if video_id in videos:
                    videos[video_id].append((image_id, example))
                else:
                    videos[video_id] = []
                    videos[video_id].append((image_id, example))
            
            # Saved the decoded tfrecord files. 
            save_tfrecords = f'{SAVE_DIR}/{c}/tfrecords/{dist}'
            if not os.path.exists(save_tfrecords):
                os.makedirs(save_tfrecords)
            for video_id in tqdm.tqdm(videos):
                with tf.io.TFRecordWriter(f'{save_tfrecords}/{video_id}.tfrecord') as file_writer:
                    for image_data in videos[video_id]:
                        record_bytes = image_data[1].SerializeToString()
                        file_writer.write(record_bytes)

        # Extract the images and ground truth.
        videos = [os.path.splitext(os.path.basename(i))[0] for i in glob.glob(f'{save_tfrecords}/*.tfrecord')]
        if dist in ['train']:
            frame_rate = TRAIN_FR
        elif dist in ['val']:
            frame_rate = VAL_FR
        elif dist in ['test']:
            frame_rate = TEST_FR
        else:
            raise ValueError("No specific data distribution settings.")
        
        for idx, key in enumerate(videos):
            print(f'Video {idx}, {key}:')
            ds = tf.data.TFRecordDataset(f'{save_tfrecords}/{key}.tfrecord').take(-1)

            for serialized in tqdm.tqdm(ds):
                example = tf.train.Example.FromString(serialized.numpy())

                image, label, prefix = parse_example(example)
                frame_id = label['image_id']

                if int(frame_id) % frame_rate == 0:
                    
                    proj, view, cam_intrinsic = parse_camera(example)
                    plane = parse_plane(example)

                    cam_intrinsic[:2, :3] = cam_intrinsic[:2, :3] / 2.4
                    center, normal = plane
                    height, width, _ = image.shape

                    im_bgr = cv2.cvtColor(image, cv2.COLOR_RGB2BGR)
                    
                    dict_out = {
                        "camera_data" : {
                            "width" : width,
                            'height' : height,
                            'camera_view_matrix':view.tolist(),
                            'camera_projection_matrix':proj.tolist(),
                            'intrinsics':{
                                'fx':cam_intrinsic[1][1],
                                'fy':cam_intrinsic[0][0],
                                'cx':cam_intrinsic[1][2],
                                'cy':cam_intrinsic[0][2]
                            }
                        }, 
                        "objects" : [],
                        "AR_data":{
                            'plane_center':[center[0],
                                            center[1],
                                            center[2]],
                            'plane_normal':[normal[0],
                                            normal[1],
                                            normal[2]]
                        }
                    }
                    
                    for object_id in range(len(label['2d_instance'])):
                        object_categories = c
                        quaternion = R.from_matrix(label['orientation'][object_id]).as_quat()
                        trans = label['translation'][object_id]

                        projected_keypoints = label['2d_instance'][object_id]
                        projected_keypoints[:, 0] *= width
                        projected_keypoints[:, 1] *= height

                        object_scale = label['scale_instance'][object_id]
                        keypoints_3d = label['3d_instance'][object_id]
                        visibility = label['visibility'][object_id]

                        dict_obj={
                            'class': object_categories,
                            'name': object_categories+'_'+str(object_id),
                            'provenance': 'objectron',
                            'location': trans.tolist(),
                            'quaternion_xyzw': quaternion.tolist(),
                            'projected_cuboid': projected_keypoints.tolist(),
                            'scale': object_scale.tolist(),
                            'keypoints_3d': keypoints_3d.tolist(),
                            'visibility': visibility.tolist()
                        }
                        # Final export
                        dict_out['objects'].append(dict_obj)

                    save_path = f"{SAVE_DIR}/{c}/{dist}/{prefix}/"
                    if not os.path.exists(save_path):
                        os.makedirs(save_path)

                    filename = f"{save_path}/{str(frame_id).zfill(5)}.json"
                    with open(filename, 'w+') as fp:
                        json.dump(dict_out, fp, indent=4, sort_keys=True)
                
                    cv2.imwrite(f"{save_path}/{str(frame_id).zfill(5)}.png", im_bgr)

'''
cmd = "echo $HOST_DATA_DIR"
result = subprocess.run(cmd, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
print(result.stdout.decode("utf-8"))


# Pull pretrained model from NGC

print("Check if model is downloaded into dir.")


cmd= "ls -l $HOST_RESULTS_DIR/pretrained_models/pretrained_fan_classification_nvimagenet_vfan_small_hybrid_nvimagenet"
result = subprocess.run(cmd, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
print(result.stdout.decode("utf-8"))

print("For multi-GPU, change num_gpus in train.yaml based on your machine or pass --gpus to the cli.")
# If you face out of memory issue, you may reduce the batch size in the spec file by passing dataset. batch_size=2
#!tao model centerpose train -e $SPECS_DIR/train.yaml results_dir=$RESULTS_DIR/




# You can set NUM_EPOCH to the epoch corresponding to any saved checkpoint
#%env NUM_EPOCH=39
NUM_EPOCH=39

# Get the name of the checkpoint corresponding to your set epoch
#tmp=!ls $HOST_RESULTS_DIR/train/*.pth | grep epoch=0$NUM_EPOCH
#%env CHECKPOINT={tmp[0]}

# Get the name of the checkpoint corresponding to your set epoch

cmd = 'ls $HOST_RESULTS_DIR/train/*.pth | grep epoch=0$NUM_EPOCH'
tmp = subprocess.run(cmd, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
print(tmp.stdout.decode("utf-8"))

#CHECKPOINT = {tmp[0]}
CHECKPOINT = "/hdd/tao-experiments/centerpose/results/pretrained_models/pretrained_fan_classification_nvimagenet_vfan_small_hybrid_nvimagenet/fan_small_hybrid_nvimagenet.pth" #MONA fix this
print('Rename a trained model: ')
print('---------------------')

cmd = "cp $CHECKPOINT $HOST_RESULTS_DIR/train/centerpose_model.pth"

result = subprocess.run(cmd, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
print(result.stdout.decode("utf-8"))


cmd = "ls -ltrh $HOST_RESULTS_DIR/train/centerpose_model.pth"
result = subprocess.run(cmd, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
print(result.stdout.decode("utf-8"))





#!cp $CHECKPOINT $HOST_RESULTS_DIR/train/centerpose_model.pth
#!ls -ltrh $HOST_RESULTS_DIR/train/centerpose_model.pth


# Evaluate on TAO model
cmd = "tao model centerpose evaluate -e $SPECS_DIR/evaluate.yaml evaluate.checkpoint=$RESULTS_DIR/train/centerpose_model.pth results_dir=$RESULTS_DIR"
result = subprocess.run(cmd, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
print(result.stdout.decode("utf-8"))




cmd = "tao model centerpose inference -e $SPECS_DIR/infer.yaml inference.checkpoint=$RESULTS_DIR/train/centerpose_model.pth results_dir=$RESULTS_DIR"
result = subprocess.run(cmd, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
print(result.stdout.decode("utf-8"))

valid_image_ext = ['.png']

def visualize_images(output_path, num_cols=4, num_images=10):
    num_rows = int(ceil(float(num_images) / float(num_cols)))
    f, axarr = plt.subplots(num_rows, num_cols, figsize=[40,30])
    f.tight_layout()
    a = [os.path.join(output_path, image) for image in os.listdir(output_path)
         if os.path.splitext(image)[1].lower() in valid_image_ext]
    for idx, img_path in enumerate(a[:num_images]):
        col_id = idx % num_cols
        row_id = idx // num_cols
        img = plt.imread(img_path)
        axarr[row_id, col_id].imshow(img)





# Visualizing the sample images.
# Note that the sample spec is not meant to produce SOTA (state-of-the-art) accuracy on Objectron dataset.
IMAGE_DIR = os.path.join(os.environ['HOST_RESULTS_DIR'], "inference")
COLS = 2 # number of columns in the visualizer grid.
IMAGES = 4 # number of images to visualize.

visualize_images(IMAGE_DIR, num_cols=COLS, num_images=IMAGES)

cmd = "mkdir -p $HOST_RESULTS_DIR/export"
result = subprocess.run(cmd, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
print(result.stdout.decode("utf-8"))


# Export the RGB model to ONNX model
cmd = "tao model centerpose export -e $SPECS_DIR/export.yaml export.checkpoint=$RESULTS_DIR/train/centerpose_model.pth export.onnx_file=$RESULTS_DIR/export/centerpose_model.onnx"
result = subprocess.run(cmd, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
print(result.stdout.decode("utf-8"))

Topic		Replies	Views
Nvlinked Titan RTX Chips: NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver Linux cuda	17	8588	June 21, 2023
"NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver" Ubuntu 16.04 CUDA Setup and Installation	79	371728	March 19, 2021
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running. on ubuntu 18.04 with NVIDIA Corporation GK110GL [Quadro K5200] Linux	7	3599	October 14, 2021
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running Drivers - Linux, Windows, MacOS cuda	38	32257	August 29, 2024
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Tesla v100 GPU installed on HP Proliant DL380 running RHEL 9.4 GPU - Hardware	0	32	May 28, 2025
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. On Ubuntu 18.04 Linux	10	5934	October 12, 2021
nvidia-smi "No devices were found" error CUDA Setup and Installation	23	62801	February 14, 2021
"NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver" on Ubuntu 17.10 CUDA Setup and Installation	10	12728	June 4, 2018
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running. Linux	125	85344	August 5, 2024
Ubuntu 20.04 - NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver Linux	0	677	December 28, 2022

HELP: NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running

Related topics