when I’m trying to deploy a NIM of llama3-8b I’m facing an issue like
INFO 08-24 19:39:57.168 ngc_injector.py:172] Model workspace is now ready. It took 2.855 seconds
INFO 08-24 19:39:57.173 async_trtllm_engine.py:74] Initializing an LLM engine (v1.0.0) with config: model='/tmp/meta--llama3-8b-instruct-75evn1pg', speculative_config=None, tokenizer='/tmp/meta--llama3-8b-instruct-75evn1pg', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0)
WARNING 08-24 19:39:57.539 logging.py:314] Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 08-24 19:39:57.554 utils.py:201] Using 0 bytes of gpu memory for PEFT cache
INFO 08-24 19:39:57.554 utils.py:207] Engine size in bytes 16067779716
INFO 08-24 19:39:57.554 utils.py:211] available device memory 23606329344
INFO 08-24 19:39:57.554 utils.py:218] Setting free_gpu_memory_fraction to 0.9
/opt/nim/start-server.sh: line 61: 32 Killed python3 -m vllm_nvext.entrypoints.openai.api_server
I’m running it on a AWS ec2 g5.large instance with image Deep Learning OSS Nvidia Driver AMI GPU TensorFlow 2.16 (Ubuntu 20.04)
could someone please help me
thank you