How to pass this --dtype=half at the runtime of container? i know my server gpu compatibility is 7.5 but i would like to use half at run time

Hi @prateek13 – take a look at this similar question here: Model says there is a compatible profile but fails on data type - #2 by neal.vaidya

It looks like your command isn’t being parsed correctly by the terminal – make sure that the python3 -m vllm_nvext.entrypoints.openai.api_server --dtype half part of the command is on the same line as the rest of the command