Hello,
I am following the instructions above to install CNS on a H100 HGX server. The validation # nvidia-smi works fine. See output below.
OTOH, I cannot pass the #2 validation ie cuda-vector-add. I experienced two issues and I resolved the first one but cannot pass the 2nd problem.
Issue #1 - runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli.real: error parsing IMEX info: unsupported IMEX channel value: all: unknown
solution #1 - add the following to the yaml file to explicitly configure IMEX:
env:
- name: NVIDIA_IMEX_CHANNELS
value: "0"
Issue #2 - Failed to allocate device vector A (error code forward compatibility was attempted on non supported HW)!
[Vector addition of 50000 elements]
when I get into the container and run the vectorAdd command, I received the error 137 which according to my seearch is a type of OOM error. Can someone help??
root@cuda-vector-add-imex:/usr/local/cuda-8.0/samples/0_Simple/vectorAdd# ls
Makefile NsightEclipse.xml readme.txt vectorAdd vectorAdd.cu vectorAdd.o
root@cuda-vector-add-imex:/usr/local/cuda-8.0/samples/0_Simple/vectorAdd# ./vectorAdd
[Vector addition of 50000 elements]
command terminated with exit code 137
Li
(base) user@h100:~$ kubectl logs nvidia-smi-test-12.4-ubuntu22.04
Tue Feb 4 02:05:46 2025
±----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.127.08 Driver Version: 550.127.08 CUDA Version: 12.4 |
|-----------------------------------------±-----------------------±---------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA H100 80GB HBM3 Off | 00000000:21:00.0 Off | 0 |
| N/A 28C P0 79W / 700W | 1MiB / 81559MiB | 0% Default |
| | | Disabled |
±----------------------------------------±-----------------------±---------------------+
| 1 NVIDIA H100 80GB HBM3 Off | 00000000:35:00.0 Off | 0 |
| N/A 27C P0 78W / 700W | 1MiB / 81559MiB | 0% Default |
| | | Disabled |
±----------------------------------------±-----------------------±---------------------+
| 2 NVIDIA H100 80GB HBM3 Off | 00000000:4C:00.0 Off | 0 |
| N/A 28C P0 73W / 700W | 1MiB / 81559MiB | 0% Default |
| | | Disabled |
±----------------------------------------±-----------------------±---------------------+
| 3 NVIDIA H100 80GB HBM3 Off | 00000000:5E:00.0 Off | 0 |
| N/A 27C P0 78W / 700W | 1MiB / 81559MiB | 0% Default |
| | | Disabled |
±----------------------------------------±-----------------------±---------------------+
| 4 NVIDIA H100 80GB HBM3 Off | 00000000:9E:00.0 Off | 0 |
| N/A 28C P0 73W / 700W | 1MiB / 81559MiB | 0% Default |
| | | Disabled |
±----------------------------------------±-----------------------±---------------------+
| 5 NVIDIA H100 80GB HBM3 Off | 00000000:B5:00.0 Off | 0 |
| N/A 26C P0 72W / 700W | 1MiB / 81559MiB | 0% Default |
| | | Disabled |
±----------------------------------------±-----------------------±---------------------+
| 6 NVIDIA H100 80GB HBM3 Off | 00000000:CA:00.0 Off | 0 |
| N/A 28C P0 75W / 700W | 1MiB / 81559MiB | 0% Default |
| | | Disabled |
±----------------------------------------±-----------------------±---------------------+
| 7 NVIDIA H100 80GB HBM3 Off | 00000000:DD:00.0 Off | 0 |
| N/A 27C P0 75W / 700W | 1MiB / 81559MiB | 0% Default |
| | | Disabled |
±----------------------------------------±-----------------------±---------------------+
±----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
±----------------------------------------------------------------------------------------+
(base) user@h100:~$ cat k8-pod-nvidia-smi.yaml
apiVersion: v1
kind: Pod
metadata:
name: nvidia-smi-test-12.4-ubuntu22.04
spec:
restartPolicy: Never # Add this line
runtimeClassName: nvidia
containers:
- name: cuda-test
image: nvidia/cuda:12.4.0-base-ubuntu22.04
command: [“nvidia-smi”]