How to set dtype for NVIDIA NIM 3-8b?

I’m trying to run NVIDIA NIM Llama3-8b-instruct:

On a T4 GPU. This might be a fool’s errand, but when I run I get what seems to be a fairly tractable error message:

ValueError: Bfloat16 is only supported on GPUs with compute capability of at least 8.0. Your Tesla T4 GPU has compute capability 7.5. You can use float16 instead by explicitly setting the`dtype` flag in CLI, for example: --dtype=half.

..but I don’t know how to do this. I tried simply adding it to the end of the docker run command (fix suggested here):

sudo docker run -it --rm     --gpus all     --shm-size=16GB     -e NGC_API_KEY="$NVIDIA_TOKEN"     -v "$LOCAL_NIM_CACHE:/opt/nim/.cache"     -u $(id -u)     -p 8000:8000     nvcr.io/nim/meta/llama3-8b-instruct:latest --dtype=half

But this gave the following error:

/opt/nvidia/nvidia_entrypoint.sh: line 67: exec: --: invalid option

It seems like the above question was running python3 -m vllm_nvext.entrypoints.openai.api_server whereas my image is trying to run /opt/nvidia/nvidia_entrypoint.sh. Quite possibly my image is also running some python file as some intermediate step, but I dunno what that command would be or if I can skip straight there or not.

So a few questions:

  • Most pressingly / to solve my immediate problem: How can I pass in dtype for this model? Can I do so at all? Do I need to swap to the more generic NIM image & try from there?
  • More generally.. can I see the Dockerfile or source code or find additional documentation for these images? I would love to know generally what arguments I can pass when starting the image.

Ok.. I read some similar forum posts like How to pass this --dtype=half at the runtime of container? i know my server gpu compatibility is 7.5 but i would like to use half at run time

and just tried adding the python command. It seems to have worked sorta. Command:

sudo docker run -it --rm     --gpus all     --shm-size=16GB     -e NGC_API_KEY="$NVIDIA_TOKEN"     -v "$LOCAL_NIM_CACHE:/opt/nim/.cache"     -u $(id -u)     -p 8000:8000     nvcr.io/nim/meta/llama3-8b-instruct:latest python3 -m vllm_nvext.entrypoints.openai.api_server --dtype half

Gave me a new error about:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1002.00 MiB. GPU 0 has a total capacity of 14.56 GiB of which 134.81 MiB is free. Process 12609 has 14.43 GiB memory in use. Of the allocated memory 13.98 GiB is allocated by PyTorch, and 19.49 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Which is probably expected because again I’m trying to run on a T4 VM.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.