Validation #2 - CUDA-Vector-Add keeps failing on cloud native stack v14.0

cloud-native-stack/install-guides/Ubuntu-22-04_Server_Developer-x86-arm64_v14.0.md at master · NVIDIA/cloud-native-stack

Hello,

I am following the instructions above to install CNS on a H100 HGX server. The validation # nvidia-smi works fine. See output below.

OTOH, I cannot pass the #2 validation ie cuda-vector-add. I experienced two issues and I resolved the first one but cannot pass the 2nd problem.

Issue #1 - runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli.real: error parsing IMEX info: unsupported IMEX channel value: all: unknown

solution #1 - add the following to the yaml file to explicitly configure IMEX:

  env:
    - name: NVIDIA_IMEX_CHANNELS
      value: "0"

Issue #2 - Failed to allocate device vector A (error code forward compatibility was attempted on non supported HW)!
[Vector addition of 50000 elements]

when I get into the container and run the vectorAdd command, I received the error 137 which according to my seearch is a type of OOM error. Can someone help??

root@cuda-vector-add-imex:/usr/local/cuda-8.0/samples/0_Simple/vectorAdd# ls
Makefile NsightEclipse.xml readme.txt vectorAdd vectorAdd.cu vectorAdd.o
root@cuda-vector-add-imex:/usr/local/cuda-8.0/samples/0_Simple/vectorAdd# ./vectorAdd
[Vector addition of 50000 elements]

command terminated with exit code 137

Li

(base) user@h100:~$ kubectl logs nvidia-smi-test-12.4-ubuntu22.04
Tue Feb 4 02:05:46 2025
±----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.127.08 Driver Version: 550.127.08 CUDA Version: 12.4 |
|-----------------------------------------±-----------------------±---------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA H100 80GB HBM3 Off | 00000000:21:00.0 Off | 0 |
| N/A 28C P0 79W / 700W | 1MiB / 81559MiB | 0% Default |
| | | Disabled |
±----------------------------------------±-----------------------±---------------------+
| 1 NVIDIA H100 80GB HBM3 Off | 00000000:35:00.0 Off | 0 |
| N/A 27C P0 78W / 700W | 1MiB / 81559MiB | 0% Default |
| | | Disabled |
±----------------------------------------±-----------------------±---------------------+
| 2 NVIDIA H100 80GB HBM3 Off | 00000000:4C:00.0 Off | 0 |
| N/A 28C P0 73W / 700W | 1MiB / 81559MiB | 0% Default |
| | | Disabled |
±----------------------------------------±-----------------------±---------------------+
| 3 NVIDIA H100 80GB HBM3 Off | 00000000:5E:00.0 Off | 0 |
| N/A 27C P0 78W / 700W | 1MiB / 81559MiB | 0% Default |
| | | Disabled |
±----------------------------------------±-----------------------±---------------------+
| 4 NVIDIA H100 80GB HBM3 Off | 00000000:9E:00.0 Off | 0 |
| N/A 28C P0 73W / 700W | 1MiB / 81559MiB | 0% Default |
| | | Disabled |
±----------------------------------------±-----------------------±---------------------+
| 5 NVIDIA H100 80GB HBM3 Off | 00000000:B5:00.0 Off | 0 |
| N/A 26C P0 72W / 700W | 1MiB / 81559MiB | 0% Default |
| | | Disabled |
±----------------------------------------±-----------------------±---------------------+
| 6 NVIDIA H100 80GB HBM3 Off | 00000000:CA:00.0 Off | 0 |
| N/A 28C P0 75W / 700W | 1MiB / 81559MiB | 0% Default |
| | | Disabled |
±----------------------------------------±-----------------------±---------------------+
| 7 NVIDIA H100 80GB HBM3 Off | 00000000:DD:00.0 Off | 0 |
| N/A 27C P0 75W / 700W | 1MiB / 81559MiB | 0% Default |
| | | Disabled |
±----------------------------------------±-----------------------±---------------------+

±----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
±----------------------------------------------------------------------------------------+
(base) user@h100:~$ cat k8-pod-nvidia-smi.yaml
apiVersion: v1
kind: Pod
metadata:
name: nvidia-smi-test-12.4-ubuntu22.04
spec:
restartPolicy: Never # Add this line
runtimeClassName: nvidia
containers:
- name: cuda-test
image: nvidia/cuda:12.4.0-base-ubuntu22.04
command: [“nvidia-smi”]

And the 3rd validation suggested in the CNS installation guide, “the DeepStream - Intelligent Video Analytics Demo Application on your NVIDIA Cloud Native Stack”, works fine in my server too. I jused the no-camera option with integrated video file. And I can see the objects being flagged from the web interface。

So just the vectorAdd test failed.