Also, while running the following command
docker run -it --rm \
--gpus all \
--shm-size=32GB \
-e NGC_API_KEY \
-v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
-u $(id -u) \
-p 8000:8000 \
nvcr.io/nim/meta/llama-3-8b-instruct:latest
Does it prepare for data-parallel or model-parallel computation? In the followng log message, it is mentioned: Profile metadata: tp: 2
. Does tp
refer to tensor processing
and that means, the layers of the model are splitted across detected GPUs? If so, how to run this docker image with data-parallel?
INFO 01-26 07:39:05.22 ngc_injector.py:147] Profile metadata: feat_lora: false
INFO 01-26 07:39:05.22 ngc_injector.py:147] Profile metadata: llm_engine: vllm
INFO 01-26 07:39:05.22 ngc_injector.py:147] Profile metadata: precision: fp16
INFO 01-26 07:39:05.22 ngc_injector.py:147] Profile metadata: tp: 2
Full log message
docker run -it --rm \
--gpus all \
--shm-size=32GB \
-e NGC_API_KEY=$NGC_API_KEY \
-v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
-u $(id -u) \
-p 8000:8000 \
nvcr.io/nim/meta/llama3-8b-instruct:latest
===========================================
== NVIDIA Inference Microservice LLM NIM ==
===========================================
NVIDIA Inference Microservice LLM NIM Version 1.0.3
Model: meta/llama3-8b-instruct
Container image Copyright (c) 2016-2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
This NIM container is governed by the NVIDIA AI Product Agreement here:
https://www.nvidia.com/en-us/data-center/products/nvidia-ai-enterprise/eula/.
A copy of this license can be found under /opt/nim/LICENSE.
The use of this model is governed by the AI Foundation Models Community License
here: https://docs.nvidia.com/ai-foundation-models-community-license.pdf.
ADDITIONAL INFORMATION: Meta Llama 3 Community License, Built with Meta Llama 3.
A copy of the Llama 3 license can be found under /opt/nim/MODEL_LICENSE.
2025-01-26 07:39:03,788 [INFO] PyTorch version 2.2.2 available.
2025-01-26 07:39:04,403 [WARNING] [TRT-LLM] [W] Logger level already set from environment. Discard new verbosity: error
2025-01-26 07:39:04,403 [INFO] [TRT-LLM] [I] Starting TensorRT-LLM init.
2025-01-26 07:39:04,458 [INFO] [TRT-LLM] [I] TensorRT-LLM inited.
[TensorRT-LLM] TensorRT-LLM version: 0.10.1.dev2024053000
INFO 01-26 07:39:05.15 api_server.py:489] NIM LLM API version 1.0.0
INFO 01-26 07:39:05.17 ngc_profile.py:218] Running NIM without LoRA. Only looking for compatible profiles that do not support LoRA.
INFO 01-26 07:39:05.17 ngc_profile.py:220] Detected 2 compatible profile(s).
INFO 01-26 07:39:05.17 ngc_injector.py:107] Valid profile: 19031a45cf096b683c4d66fff2a072c0e164a24f19728a58771ebfc4c9ade44f (vllm-fp16-tp2) on GPUs [0, 1, 2]
INFO 01-26 07:39:05.17 ngc_injector.py:107] Valid profile: 8835c31752fbc67ef658b20a9f78e056914fdef0660206d82f252d62fd96064d (vllm-fp16-tp1) on GPUs [0, 1, 2]
INFO 01-26 07:39:05.17 ngc_injector.py:142] Selected profile: 19031a45cf096b683c4d66fff2a072c0e164a24f19728a58771ebfc4c9ade44f (vllm-fp16-tp2)
INFO 01-26 07:39:05.22 ngc_injector.py:147] Profile metadata: feat_lora: false
INFO 01-26 07:39:05.22 ngc_injector.py:147] Profile metadata: llm_engine: vllm
INFO 01-26 07:39:05.22 ngc_injector.py:147] Profile metadata: precision: fp16
INFO 01-26 07:39:05.22 ngc_injector.py:147] Profile metadata: tp: 2
INFO 01-26 07:39:05.22 ngc_injector.py:167] Preparing model workspace. This step might download additional files to run the model.
INFO 01-26 07:39:05.23 ngc_injector.py:173] Model workspace is now ready. It took 0.001 seconds
2025-01-26 07:39:06,909 INFO worker.py:1749 -- Started a local Ray instance.
INFO 01-26 07:39:07.954 llm_engine.py:98] Initializing an LLM engine (v0.4.1) with config: model='/tmp/meta--llama3-8b-instruct-yba7l2iw', speculative_config=None, tokenizer='/tmp/meta--llama3-8b-instruct-yba7l2iw', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=2, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0)
WARNING 01-26 07:39:08.133 logging.py:314] Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 01-26 07:39:10.963 utils.py:609] Found nccl from library /usr/local/lib/python3.10/dist-packages/nvidia/nccl/lib/libnccl.so.2
(RayWorkerWrapper pid=7440) INFO 01-26 07:39:10 utils.py:609] Found nccl from library /usr/local/lib/python3.10/dist-packages/nvidia/nccl/lib/libnccl.so.2
INFO 01-26 07:39:11 selector.py:28] Using FlashAttention backend.
(RayWorkerWrapper pid=7440) INFO 01-26 07:39:11 selector.py:28] Using FlashAttention backend.
INFO 01-26 07:39:12 pynccl_utils.py:43] vLLM is using nccl==2.19.3
(RayWorkerWrapper pid=7440) INFO 01-26 07:39:12 pynccl_utils.py:43] vLLM is using nccl==2.19.3
INFO 01-26 07:39:12.187 utils.py:130] reading GPU P2P access cache from /opt/nim/.cache/vllm/vllm/gpu_p2p_access_cache_for_0,1.json
WARNING 01-26 07:39:12 custom_all_reduce.py:74] Custom allreduce is disabled because your platform lacks GPU P2P capability or P2P test failed. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(RayWorkerWrapper pid=7440) INFO 01-26 07:39:12 utils.py:130] reading GPU P2P access cache from /opt/nim/.cache/vllm/vllm/gpu_p2p_access_cache_for_0,1.json
(RayWorkerWrapper pid=7440) WARNING 01-26 07:39:12 custom_all_reduce.py:74] Custom allreduce is disabled because your platform lacks GPU P2P capability or P2P test failed. To silence this warning, specify disable_custom_all_reduce=True explicitly.
INFO 01-26 07:39:18 model_runner.py:173] Loading model weights took 7.4829 GB
(RayWorkerWrapper pid=7440) INFO 01-26 07:39:18 model_runner.py:173] Loading model weights took 7.4829 GB
INFO 01-26 07:39:19 ray_gpu_executor.py:217] # GPU blocks: 33915, # CPU blocks: 4096
INFO 01-26 07:39:21 model_runner.py:973] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 01-26 07:39:21 model_runner.py:977] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
(RayWorkerWrapper pid=7440) INFO 01-26 07:39:21 model_runner.py:973] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
(RayWorkerWrapper pid=7440) INFO 01-26 07:39:21 model_runner.py:977] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
(RayWorkerWrapper pid=7440) INFO 01-26 07:39:26 model_runner.py:1054] Graph capturing finished in 6 secs.
INFO 01-26 07:39:26 model_runner.py:1054] Graph capturing finished in 6 secs.
WARNING 01-26 07:39:26.880 logging.py:314] Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 01-26 07:39:26.887 serving_chat.py:347] Using default chat template:
{% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>
'+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0 == 0 %}{% set content = bos_token + content %}{% endif %}{{ content }}{% endfor %}{% if add_generation_prompt %}{{ '<|start_header_id|>assistant<|end_header_id|>
' }}{% endif %}
WARNING 01-26 07:39:27.40 logging.py:314] Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 01-26 07:39:27.48 api_server.py:456] Serving endpoints:
0.0.0.0:8000/openapi.json
0.0.0.0:8000/docs
0.0.0.0:8000/docs/oauth2-redirect
0.0.0.0:8000/metrics
0.0.0.0:8000/v1/health/ready
0.0.0.0:8000/v1/health/live
0.0.0.0:8000/v1/models
0.0.0.0:8000/v1/version
0.0.0.0:8000/v1/chat/completions
0.0.0.0:8000/v1/completions
INFO 01-26 07:39:27.48 api_server.py:460] An example cURL request:
curl -X 'POST' \
'http://0.0.0.0:8000/v1/chat/completions' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"model": "meta/llama3-8b-instruct",
"messages": [
{
"role":"user",
"content":"Hello! How are you?"
},
{
"role":"assistant",
"content":"Hi! I am quite well, how can I help you today?"
},
{
"role":"user",
"content":"Can you write me a song?"
}
],
"top_p": 1,
"n": 1,
"max_tokens": 15,
"stream": true,
"frequency_penalty": 1.0,
"stop": ["hello"]
}'
INFO 01-26 07:39:27.80 server.py:82] Started server process [31]
INFO 01-26 07:39:27.80 on.py:48] Waiting for application startup.
INFO 01-26 07:39:27.85 on.py:62] Application startup complete.
INFO 01-26 07:39:27.87 server.py:214] Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)