Why is the NIM container setting the datatype to bfloat16 when I pick an fp16 profile?
How do I force a data type to work on a compatible profile?
Hardware: Titan RTX 24GB
O/S: Ubuntu
Mistral 7b instruct v03 says there is a compatible profile
(base) joe@hp-z820:~$ docker run --gpus all nvcr.io/nim/mistralai/mistral-7b-instruct-v03:latest list-model-profiles
===========================================
== NVIDIA Inference Microservice LLM NIM ==
===========================================
NVIDIA Inference Microservice LLM NIM Version 1.0.0
Model: nim/mistralai/mistral-7b-instruct-v03
Container image Copyright (c) 2016-2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
This NIM container is governed by the NVIDIA AI Product Agreement here:
https://www.nvidia.com/en-us/data-center/products/nvidia-ai-enterprise/eula/
A copy of this license can be found under /opt/nim/LICENSE.
The use of this model is governed by the AI Foundation Models Community License here: https://docs.nvidia.com/ai-foundation-models-community-license.pdf.
ADDITIONAL INFORMATION: Apache 2.0 License (https://www.apache.org/licenses/LICENSE-2.0).
SYSTEM INFO
- Free GPUs:
- [1e02:10de] (0) NVIDIA TITAN RTX [current utilization: 2%]
MODEL PROFILES
- Compatible with system and runnable:
- 7680b65db3dde6ebb3cb045e9176426b32d2e14023e61f1cd2137216dd1dc789 (vllm-fp16-tp1)
- With LoRA support:
- 114fc68ad2c150e37eb03a911152f342e4e7423d5efb769393d30fa0b0cd1f9e (vllm-fp16-tp1-lora)
- Incompatible with system:
- 48004baf4f45ca177aa94abfd3c5c54858808ad728914b1626c3cf038ea85bc4 (tensorrt_llm-h100-fp8-tp2-latency)
- 5c17c27186b232e834aee9c61d1f5db388874da40053d70b84fd1386421ff577 (tensorrt_llm-l40s-fp8-tp2-latency)
- 08ab4363f225c19e3785b58408fa4dcac472459cca1febcfaffb43f873557e87 (tensorrt_llm-h100-fp8-tp1-throughput)
- cc18942f40e770aa27a0b02c1f5bf1458a6fedd10a1ed377630d30d71a1b36db (tensorrt_llm-l40s-fp8-tp1-throughput)
- dea9af90d5311ff2d651db8c16f752d014053d3b1c550474cbeda241f81c96bd (tensorrt_llm-a100-fp16-tp2-latency)
- 6064ab4c33a1c6da8058422b8cb0347e72141d203c77ba309ce5c5533f548188 (tensorrt_llm-h100-fp16-tp2-latency)
- ef22c7cecbcf2c8b3889bd58a48095e47a8cc0394d221acda1b4087b46c6f3e9 (tensorrt_llm-l40s-fp16-tp2-latency)
- c79561a74f97b157de12066b7a137702a4b09f71f4273ff747efe060881fca92 (tensorrt_llm-a100-fp16-tp1-throughput)
- 8833b9eba1bd4fbed4f764e64797227adca32e3c1f630c2722a8a52fee2fd1fa (tensorrt_llm-h100-fp16-tp1-throughput)
- 95f764b13dca98173068ad7dd9184098e18a04ad803722540a911d35a599378a (tensorrt_llm-l40s-fp16-tp1-throughput)
- 7387979dae9c209b33010e5da9aae4a94f75d928639ba462201e88a5dd4ac185 (vllm-fp16-tp2)
- 2c57f0135f9c6de0c556ba37f43f55f6a6c0a25fe0506df73e189aedfbd8b333 (tensorrt_llm-a100-fp16-tp1-throughput-lora)
- 8f9730e45a88fb2ac16ce2ce21d7460479da1fd8747ba32d2b92fc4f6140ba83 (tensorrt_llm-h100-fp16-tp1-throughput-lora)
- eb445d1e451ed3987ca36da9be6bb4cdd41e498344cbf477a1600198753883ff (tensorrt_llm-l40s-fp16-tp1-throughput-lora)
- 5797a519e300612f87f8a4a50a496a840fa747f7801b2dcd0cc9a3b4b949dd92 (vllm-fp16-tp2-lora)
When I start the container with the compatible profile it fails.
docker run -it --rm --gpus all --shm-size=16GB -e NGC_API_KEY=MY_APIKEY -e NIM_MODEL_PROFILE=7680b65db3dde6ebb3cb045e9176426b32d2e14023e61f1cd2137216dd1dc789 -v "/home/joe/.cache/nim:/opt/nim/.cache" -u $(id -u) -p 8000:8000 nvcr.io/nim/mistralai/mistral-7b-instruct-v03:latest
...
...
===========================================
== NVIDIA Inference Microservice LLM NIM ==
===========================================
NVIDIA Inference Microservice LLM NIM Version 1.0.0
Model: nim/mistralai/mistral-7b-instruct-v03
Container image Copyright (c) 2016-2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
This NIM container is governed by the NVIDIA AI Product Agreement here:
https://www.nvidia.com/en-us/data-center/products/nvidia-ai-enterprise/eula/
A copy of this license can be found under /opt/nim/LICENSE.
The use of this model is governed by the AI Foundation Models Community License here: https://docs.nvidia.com/ai-foundation-models-community-license.pdf.
ADDITIONAL INFORMATION: Apache 2.0 License (https://www.apache.org/licenses/LICENSE-2.0).
2024-08-13 02:46:38,253 [INFO] PyTorch version 2.2.2 available.
2024-08-13 02:46:39,051 [WARNING] [TRT-LLM] [W] Logger level already set from environment. Discard new verbosity: error
2024-08-13 02:46:39,051 [INFO] [TRT-LLM] [I] Starting TensorRT-LLM init.
2024-08-13 02:46:39,083 [INFO] [TRT-LLM] [I] TensorRT-LLM inited.
[TensorRT-LLM] TensorRT-LLM version: 0.10.1.dev2024053000
INFO 08-13 02:46:40.238 api_server.py:489] NIM LLM API version 1.0.0
INFO 08-13 02:46:40.240 ngc_profile.py:217] Running NIM without LoRA. Only looking for compatible profiles that do not support LoRA.
INFO 08-13 02:46:40.240 ngc_profile.py:219] Detected 1 compatible profile(s).
INFO 08-13 02:46:40.240 ngc_injector.py:106] Valid profile: 7680b65db3dde6ebb3cb045e9176426b32d2e14023e61f1cd2137216dd1dc789 (vllm-fp16-tp1) on GPUs [0]
INFO 08-13 02:46:40.240 ngc_injector.py:141] Selected profile: 7680b65db3dde6ebb3cb045e9176426b32d2e14023e61f1cd2137216dd1dc789 (vllm-fp16-tp1)
INFO 08-13 02:46:40.649 ngc_injector.py:146] Profile metadata: feat_lora: false
INFO 08-13 02:46:40.649 ngc_injector.py:146] Profile metadata: llm_engine: vllm
INFO 08-13 02:46:40.649 ngc_injector.py:146] Profile metadata: precision: fp16
INFO 08-13 02:46:40.649 ngc_injector.py:146] Profile metadata: tp: 1
...
It then fails with the dreaded datatype problem but the profile said it was fp16
ValueError: Bfloat16 is only supported on GPUs with compute capability of at least 8.0. Your NVIDIA TITAN RTX GPU has compute capability 7.5. You can use float16 instead by explicitly setting thedtype flag in CLI, for example: --dtype=half.
Thanks. That worked for me with a little additional tweaking.
This feels like a defect that the runtime doesn’t pick the correct model type for the hardware when the model is already in a data type that works for the card.
The model profile is compatible with a datatype that works
The runtime always picks the more modern bfloat16
I have to override a data type to get it to work when the actual data type as coughed up by the profile matches the hardware.
Then we have to sort of guess at other tuning parameters to get it to fit. The tooling looks like it picks 32GB as a default.
When I run the model with an explicit entrypoint, it errors out because of model size. Note that the model weights were 13.5GB.
I have to run the container for a supported profile with a data override and a model size --dtype half --max-model-len 26000
Command that will fail with model too large
(base) joe@hp-z820:~$ docker run -it --rm --gpus all --shm-size=16GB -e NGC_API_KEY=MY_API_KEY -e NIM_MODEL_PROFILE=7680b65db3dde6ebb3cb045e9176426b32d2e14023e61f1cd2137216dd1dc789 -v "/home/joe/.cache/nim:/opt/nim/.cache" -u $(id -u) -p 8000:8000 nvcr.io/nim/mistralai/mistral-7b-instruct-v03:latest python3 -m vllm_nvext.entrypoints.openai.api_server --dtype half
===========================================
== NVIDIA Inference Microservice LLM NIM ==
===========================================
NVIDIA Inference Microservice LLM NIM Version 1.0.0
Model: nim/mistralai/mistral-7b-instruct-v03
Container image Copyright (c) 2016-2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
This NIM container is governed by the NVIDIA AI Product Agreement here:
https://www.nvidia.com/en-us/data-center/products/nvidia-ai-enterprise/eula/
A copy of this license can be found under /opt/nim/LICENSE.
The use of this model is governed by the AI Foundation Models Community License here: https://docs.nvidia.com/ai-foundation-models-community-license.pdf.
ADDITIONAL INFORMATION: Apache 2.0 License (https://www.apache.org/licenses/LICENSE-2.0).
2024-08-13 12:02:44,557 [INFO] PyTorch version 2.2.2 available.
[TensorRT-LLM] TensorRT-LLM version: 0.10.1.dev2024053000
INFO 08-13 12:02:46.689 api_server.py:489] NIM LLM API version 1.0.0
INFO 08-13 12:02:46.692 ngc_profile.py:217] Running NIM without LoRA. Only looking for compatible profiles that do not support LoRA.
INFO 08-13 12:02:46.692 ngc_profile.py:219] Detected 1 compatible profile(s).
INFO 08-13 12:02:46.692 ngc_injector.py:106] Valid profile: 7680b65db3dde6ebb3cb045e9176426b32d2e14023e61f1cd2137216dd1dc789 (vllm-fp16-tp1) on GPUs [0]
INFO 08-13 12:02:46.692 ngc_injector.py:141] Selected profile: 7680b65db3dde6ebb3cb045e9176426b32d2e14023e61f1cd2137216dd1dc789 (vllm-fp16-tp1)
INFO 08-13 12:02:47.90 ngc_injector.py:146] Profile metadata: feat_lora: false
INFO 08-13 12:02:47.90 ngc_injector.py:146] Profile metadata: tp: 1
INFO 08-13 12:02:47.90 ngc_injector.py:146] Profile metadata: llm_engine: vllm
INFO 08-13 12:02:47.90 ngc_injector.py:146] Profile metadata: precision: fp16
INFO 08-13 12:02:47.90 ngc_injector.py:166] Preparing model workspace. This step might download additional files to run the model.
INFO 08-13 12:02:51.27 ngc_injector.py:172] Model workspace is now ready. It took 3.937 seconds
WARNING 08-13 12:02:51.39 config.py:1017] Casting torch.bfloat16 to torch.float16.
INFO 08-13 12:02:51.40 llm_engine.py:98] Initializing an LLM engine (v0.4.1) with config: model='/tmp/mistralai--mistral-7b-instruct-v0.3-2l3t1z72', speculative_config=None, tokenizer='/tmp/mistralai--mistral-7b-instruct-v0.3-2l3t1z72', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0)
W
...
...
ValueError: The model's max seq len (32768) is larger than the maximum number of tokens that can be stored in KV cache (20928). Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine.
This will run if we force the the data type and max model length
Is there any negative to shrinking the model length to fit?
Is this even the right parameter to be changing?
@joe173 appreciate the feedback, we will take a look and see if we can improve the behavior there.
Shrinking the the sequence length is a good way of decreasing the memory requirements – basically it means that the size of the KV cache is limited, which can be a very large portion of the memory usage. The downside is that you won’t be able to send/generate messages that are quite as long. Otherwise the model accuracy shouldn’t be affected.