I’m trying to run NVIDIA NIM Llama3-8b-instruct:
On a T4 GPU. This might be a fool’s errand, but when I run I get what seems to be a fairly tractable error message:
ValueError: Bfloat16 is only supported on GPUs with compute capability of at least 8.0. Your Tesla T4 GPU has compute capability 7.5. You can use float16 instead by explicitly setting the`dtype` flag in CLI, for example: --dtype=half.
..but I don’t know how to do this. I tried simply adding it to the end of the docker run command (fix suggested here):
sudo docker run -it --rm --gpus all --shm-size=16GB -e NGC_API_KEY="$NVIDIA_TOKEN" -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" -u $(id -u) -p 8000:8000 nvcr.io/nim/meta/llama3-8b-instruct:latest --dtype=half
But this gave the following error:
/opt/nvidia/nvidia_entrypoint.sh: line 67: exec: --: invalid option
It seems like the above question was running python3 -m vllm_nvext.entrypoints.openai.api_server whereas my image is trying to run /opt/nvidia/nvidia_entrypoint.sh. Quite possibly my image is also running some python file as some intermediate step, but I dunno what that command would be or if I can skip straight there or not.
So a few questions:
- Most pressingly / to solve my immediate problem: How can I pass in dtype for this model? Can I do so at all? Do I need to swap to the more generic NIM image & try from there?
- More generally.. can I see the Dockerfile or source code or find additional documentation for these images? I would love to know generally what arguments I can pass when starting the image.