Hi @aryason , thanks for your answer and sorry for my delay. Yes, I did download Llama3-8b NIM, but while self-hosting it I got the following error:
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 149, in execute_method
return executor(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 100, in init_device
_check_if_gpu_supports_dtype(self.model_config.dtype)
File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 321, in _check_if_gpu_supports_dtype
raise ValueError(
ValueError: Bfloat16 is only supported on GPUs with compute capability of at least 8.0. Your Tesla V100-DGXS-32GB GPU has compute capability 7.0. You can use float16 instead by explicitly setting the`dtype` flag in CLI, for example: --dtype=half.
Traceback (most recent call last):
File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.10/dist-packages/vllm_nvext/entrypoints/openai/api_server.py", line 498, in <module>
engine = AsyncLLMEngineFactory.from_engine_args(engine_args, usage_context=UsageContext.OPENAI_API_SERVER)
File "/usr/local/lib/python3.10/dist-packages/vllm_nvext/engine/async_trtllm_engine.py", line 412, in from_engine_args
engine = engine_cls.from_engine_args(engine_args, start_engine_loop, usage_context)
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 365, in from_engine_args
engine = cls(
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 323, in __init__
self.engine = self._init_engine(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 442, in _init_engine
return engine_class(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 148, in __init__
self.model_executor = executor_class(
File "/usr/local/lib/python3.10/dist-packages/vllm/executor/ray_gpu_executor.py", line 382, in __init__
super().__init__(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/executor/executor_base.py", line 41, in __init__
self._init_executor()
File "/usr/local/lib/python3.10/dist-packages/vllm/executor/ray_gpu_executor.py", line 45, in _init_executor
self._init_workers_ray(placement_group)
File "/usr/local/lib/python3.10/dist-packages/vllm/executor/ray_gpu_executor.py", line 181, in _init_workers_ray
self._run_workers("init_device")
File "/usr/local/lib/python3.10/dist-packages/vllm/executor/ray_gpu_executor.py", line 318, in _run_workers
driver_worker_output = self.driver_worker.execute_method(
File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 158, in execute_method
raise e
File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 149, in execute_method
return executor(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 100, in init_device
_check_if_gpu_supports_dtype(self.model_config.dtype)
File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 321, in _check_if_gpu_supports_dtype
raise ValueError(
ValueError: Bfloat16 is only supported on GPUs with compute capability of at least 8.0. Your Tesla V100-DGXS-32GB GPU has compute capability 7.0. You can use float16 instead by explicitly setting the`dtype` flag in CLI, for example: --dtype=half.
(RayWorkerWrapper pid=2678) ERROR 09-25 12:54:42 worker_base.py:157] Error executing method init_device. This might cause deadlock in distributed execution.
(RayWorkerWrapper pid=2678) ERROR 09-25 12:54:42 worker_base.py:157] Traceback (most recent call last):
(RayWorkerWrapper pid=2678) ERROR 09-25 12:54:42 worker_base.py:157] File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 149, in execute_method
(RayWorkerWrapper pid=2678) ERROR 09-25 12:54:42 worker_base.py:157] return executor(*args, **kwargs)
(RayWorkerWrapper pid=2678) ERROR 09-25 12:54:42 worker_base.py:157] File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 100, in init_device
(RayWorkerWrapper pid=2678) ERROR 09-25 12:54:42 worker_base.py:157] _check_if_gpu_supports_dtype(self.model_config.dtype)
(RayWorkerWrapper pid=2678) ERROR 09-25 12:54:42 worker_base.py:157] File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 321, in _check_if_gpu_supports_dtype
(RayWorkerWrapper pid=2678) ERROR 09-25 12:54:42 worker_base.py:157] raise ValueError(
(RayWorkerWrapper pid=2678) ERROR 09-25 12:54:42 worker_base.py:157] ValueError: Bfloat16 is only supported on GPUs with compute capability of at least 8.0. Your Tesla V100-DGXS-32GB GPU has compute capability 7.0. You can use float16 instead by explicitly setting the`dtype` flag in CLI, for example: --dtype=half.
Is there any viable way to set fp16 rather than bf16? I already tried to pass -e NIM_MODEL_PROFILE="vllm-fp16-tp1"
but it did not work.
Thanks for your support.