How to pass this --dtype=half at the runtime of container? i know my server gpu compatibility is 7.5 but i would like to use half at run time

docker command :-

docker run -it --rm --gpus all --shm-size=16GB -e NGC_API_KEY=$NGC_API_KEY -v “$LOCAL_NIM_CACHE:/opt/nim/.cache” -u $(id -u) -p 8000:8000 nvcr.io/nim/meta/llama3-8b-instruct:1.0.0

Error:-

ValueError: Bfloat16 is only supported on GPUs with compute capability of at least 8.0. Your Tesla T4 GPU has compute capability 7.5. You can use float16 instead by explicitly setting thedtype flag in CLI, for example: --dtype=half.

I also tried this :-

docker run -it --rm --gpus all --shm-size=16GB -e NGC_API_KEY=$NGC_API_KEY -v “$LOCAL_NIM_CACHE:/opt/nim/.cache” -u $(id -u) -p 8000:8000 nvcr.io/nim/meta/llama3-8b-instruct:1.0.0
python3 -m vllm_nvext.entrypoints.openai.api_server --dtype half --max-model-len 26000

Error :-

ValueError: Bfloat16 is only supported on GPUs with compute capability of at least 8.0. Your Tesla T4 GPU has compute capability 7.5. You can use float16 instead by explicitly setting thedtype flag in CLI, for example: --dtype=half.
/usr/bin/python3.10: Error while finding module specification for ‘vllm_nvext.entrypoints.openai.api_server’ (ModuleNotFoundError: No module named ‘vllm_nvext’)

Hi @prateek13 – take a look at this similar question here: Model says there is a compatible profile but fails on data type - #2 by neal.vaidya

It looks like your command isn’t being parsed correctly by the terminal – make sure that the python3 -m vllm_nvext.entrypoints.openai.api_server --dtype half part of the command is on the same line as the rest of the command