/opt/nim/start-server.sh: line 61: 32 Killed python3 -m vllm_nvext.entrypoints.openai.api_server

Hi, I get below error when running NVIDIA NIM docker container nvcr.io/nim/meta/llama3-8b-instruct:1.0.0

/opt/nim/start-server.sh: line 61: 32 Killed python3 -m vllm_nvext.entrypoints.openai.api_server

AWS Instance Type
g5.xlarge

NVIDIA GPU-Optimized AMI

uname -m && cat /etc/*release
x86_64
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=22.04
DISTRIB_CODENAME=jammy
DISTRIB_DESCRIPTION=“Ubuntu 22.04.4 LTS”
PRETTY_NAME=“Ubuntu 22.04.4 LTS”
NAME=“Ubuntu”
VERSION_ID=“22.04”
VERSION=“22.04.4 LTS (Jammy Jellyfish)”
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL=“XXX”
SUPPORT_URL=“XXX”
BUG_REPORT_URL=“XXX”
PRIVACY_POLICY_URL=“XXX”
UBUNTU_CODENAME=jammy

lspci | grep -i nvidia
00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1)

nvidia-smi
Tue Jul 9 20:51:47 2024
±--------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01 Driver Version: 535.183.01 CUDA Version: 12.2 |
|-----------------------------------------±---------------------±---------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A10G On | 00000000:00:1E.0 Off | 0 |
| 0% 32C P8 9W / 300W | 0MiB / 23028MiB | 0% Default |
| | | N/A |
±----------------------------------------±---------------------±---------------------+

±--------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |

nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Thu_Nov_18_09:45:30_PST_2021
Cuda compilation tools, release 11.5, V11.5.119
Build cuda_11.5.r11.5/compiler.30672275_0

Complete docker run command output:
docker run -it --rm --gpus all --shm-size=16GB -e NGC_API_KEY -v ~/.cache/nim:/opt/nim/.cache -p 8000:8000 nvcr.io/nim/meta/llama3-8b-instruct:1.0.0

===========================================
== NVIDIA Inference Microservice LLM NIM ==

NVIDIA Inference Microservice LLM NIM Version 1.0.0
Model: nim/meta/llama3-8b-instruct

Container image Copyright (c) 2016-2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This NIM container is governed by the NVIDIA AI Product Agreement here:
XXXX.
A copy of this license can be found under /opt/nim/LICENSE.

The use of this model is governed by the AI Foundation Models Community License
here: XXXX

ADDITIONAL INFORMATION: Meta Llama 3 Community License, Built with Meta Llama 3.
A copy of the Llama 3 license can be found under /opt/nim/MODEL_LICENSE.

2024-07-09 20:34:26,742 [INFO] PyTorch version 2.2.2 available.
2024-07-09 20:34:27,768 [WARNING] [TRT-LLM] [W] Logger level already set from environment. Discard new verbosity: error
2024-07-09 20:34:27,768 [INFO] [TRT-LLM] [I] Starting TensorRT-LLM init.
2024-07-09 20:34:27,935 [INFO] [TRT-LLM] [I] TensorRT-LLM inited.
[TensorRT-LLM] TensorRT-LLM version: 0.10.1.dev2024053000
INFO 07-09 20:34:29.400 api_server.py:489] NIM LLM API version 1.0.0
INFO 07-09 20:34:29.406 ngc_profile.py:217] Running NIM without LoRA. Only looking for compatible profiles that do not support LoRA.
INFO 07-09 20:34:29.406 ngc_profile.py:219] Detected 2 compatible profile(s).
INFO 07-09 20:34:29.406 ngc_injector.py:106] Valid profile: c334b76d50783655bdf62b8138511456f7b23083553d310268d0d05f254c012b (tensorrt_llm-a10g-fp16-tp1-throughput) on GPUs [0]
INFO 07-09 20:34:29.406 ngc_injector.py:106] Valid profile: 8835c31752fbc67ef658b20a9f78e056914fdef0660206d82f252d62fd96064d (vllm-fp16-tp1) on GPUs [0]
INFO 07-09 20:34:29.407 ngc_injector.py:141] Selected profile: c334b76d50783655bdf62b8138511456f7b23083553d310268d0d05f254c012b (tensorrt_llm-a10g-fp16-tp1-throughput)
INFO 07-09 20:34:30.223 ngc_injector.py:146] Profile metadata: gpu_device: 2237:10de
INFO 07-09 20:34:30.223 ngc_injector.py:146] Profile metadata: profile: throughput
INFO 07-09 20:34:30.223 ngc_injector.py:146] Profile metadata: tp: 1
INFO 07-09 20:34:30.223 ngc_injector.py:146] Profile metadata: pp: 1
INFO 07-09 20:34:30.223 ngc_injector.py:146] Profile metadata: feat_lora: false
INFO 07-09 20:34:30.223 ngc_injector.py:146] Profile metadata: gpu: A10G
INFO 07-09 20:34:30.223 ngc_injector.py:146] Profile metadata: llm_engine: tensorrt_llm
INFO 07-09 20:34:30.223 ngc_injector.py:146] Profile metadata: precision: fp16
INFO 07-09 20:34:30.223 ngc_injector.py:166] Preparing model workspace. This step might download additional files to run the model.
INFO 07-09 20:34:33.320 ngc_injector.py:172] Model workspace is now ready. It took 3.097 seconds
INFO 07-09 20:34:33.325 async_trtllm_engine.py:74] Initializing an LLM engine (v1.0.0) with config: model=‘/tmp/meta–llama3-8b-instruct-ba3010gf’, speculative_config=None, tokenizer=‘/tmp/meta–llama3-8b-instruct-ba3010gf’, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend=‘outlines’), seed=0)
WARNING 07-09 20:34:33.683 logging.py:314] Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 07-09 20:34:33.699 utils.py:201] Using 0 bytes of gpu memory for PEFT cache
INFO 07-09 20:34:33.700 utils.py:207] Engine size in bytes 16067779716
INFO 07-09 20:34:33.700 utils.py:211] available device memory 23606263808
INFO 07-09 20:34:33.700 utils.py:218] Setting free_gpu_memory_fraction to 0.9
/opt/nim/start-server.sh: line 61: 32 Killed python3 -m vllm_nvext.entrypoints.openai.api_server