Clara Train SDK Installation

Does Clara Train SDK v3.0 run only on AGX, EGX, DGX, or Cloud (AWS, GCP, Azure, or other cloud instances) as described on below URL? Can I install Clara Train SDK on my local GPU servers or workstations?

URL: https://developer.nvidia.com/clara-medical-imaging

I read NVIDIA Clara SDK document on below URL. It only shows GPU hardware requirement. It does not show hardware requirements, which limited for AGX, EGX, DGX, and Cloud. I have a local GPU server with P100s and try to install Clara Train SDK. I could install Clara Train SDK container but I always got below error message after running the Clara SDK container. According to Clara Medical Image’s website system description above image shows the Clara Train SDK’s hardware system requirements, looks like limited on AGX, EGX, DGX, or Cloud. If NVIDIA Clara Train SDK run only on on AGX, EGX, DGX, or Cloud, please let me know. I am doing wrong.

URL: https://docs.nvidia.com/clara/tlt-mi/clara-train-sdk-v3.0/nvmidl/installation.html

Error Message:

NVIDIA Release 19.10 (build 8471601)
TensorFlow Version 1.14.0

Container image Copyright © 2019, NVIDIA CORPORATION. All rights reserved.
Copyright 2017-2019 The TensorFlow Authors. All rights reserved.

Various files include modifications © NVIDIA CORPORATION. All rights reserved.
NVIDIA modifications are covered by the license terms that apply to the underlying project or file.
ERROR: No supported GPU(s) detected to run this container

NOTE: MOFED driver for multi-node communication was not detected.
Multi-node communication performance may be reduced.

Hi Peita,

Thanks for your interest in Clara Train V3.0. The P100s in your server have CUDA Compute Capability 6.0, sufficient for Clara Train v3.

The other requirement is the NVIDIA Container Toolkit. It seems this may not be installed and configured properly based on the error detecting the GPU in the container.

Please be sure this is installed and configured with a supported version of Docker, and that you’re launching the container as described in the Clara Train installation instructions.

Thanks,
Kris

Hi, kkersten

I have installed NVIDIA Container Toolkit on the machine. I can run below docker execution command and I got the correct response below image. I think NVIDIA Clara requires NGC Ready Machine: https://docs.nvidia.com/ngc/ngc-ready-systems/index.html. Because my other machine is listed on NGC Ready Machine list on the URL, could run NVIDIA Clara container and detect GPU. I am running AIAA as well. Can you confirm that NVIDIA Clara SDK runs only on NGC Ready Machine? If not, please tell me why my first machine does not detect P100?

sudo docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi

image

Even though, I got “ERROR: No supported GPU(s) detected to run this container” error. My docker container can detect one of 2 GPUs. Is this software bug because nvidia-smi can find my GPU after running the docker container?

================
== TensorFlow ==

NVIDIA Release 19.10 (build 8471601)
TensorFlow Version 1.14.0

Container image Copyright © 2019, NVIDIA CORPORATION. All rights reserved.
Copyright 2017-2019 The TensorFlow Authors. All rights reserved.

Various files include modifications © NVIDIA CORPORATION. All rights reserved.
NVIDIA modifications are covered by the license terms that apply to the underlying project or file.
ERROR: No supported GPU(s) detected to run this container

NOTE: MOFED driver for multi-node communication was not detected.
Multi-node communication performance may be reduced.

root@d53027394af5:/opt/nvidia# nvidia-smi
Tue Sep 29 21:14:14 2020
±----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01 Driver Version: 440.33.01 CUDA Version: 10.2 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla P100-PCIE… On | 00000000:81:00.0 Off | Off |
| N/A 33C P0 26W / 250W | 0MiB / 16280MiB | 0% Default |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
±----------------------------------------------------------------------------+

Hi Peita,

Thanks for digging into this further. Your config looks good based on the output in your previous posts. This could be a bug with the framework. I’m working with engineering on a workaround, will follow up here. Stay tuned and thanks for your patience!

-Kris

Hey Peita,

There shouldn’t be a conflict with the framework and your R440 driver on the P100s.

A couple followup questions - Are you running the nvidia/cuda:11.0-base and Clara containers with the same docker args? Any differences in the env that’s passed to the containers?

Are you able to see both GPUs in the standard TF container?

docker run --rm --gpus=all nvcr.io/nvidia/tensorflow:19.10-py3 nvidia-smi

I’d also recommend adding your user to the docker group rather than running with sudo.

Thanks,
Kris

Hi, Kris

I could not run the standard TF container. I got below error.

Unable to find image ‘nvcr.io/nvidia/tensorflow:19.10-py3’ locally
docker: Error response from daemon: Get https://nvcr.io/v2/nvidia/tensorflow/manifests/19.10-py3: received unexpected HTTP status: 502 Bad Gateway.