According to the page: vLLM | NVIDIA NGC
Start the HTTP inference server inside the container OK:
(APIServer pid=157) INFO: Started server process(157)
(APIServer pid=157) INFO: Waiting for application startup.
(APIServer pid=157) INFO: Application startup complete.
open a new console, run the client:
curl -X 'POST' \
'http://0.0.0.0:8000/v1/chat/completions' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"model": "nvidia/Llama-3.1-8B-Instruct-FP8",
"messages": [{"role":"user", "content": "What is NVIDIA famous for?"}]
}'
but, report the error:
Unable to round-trip http request to upstream: dial tcp 0.0.0.0:8000: connect: connection refused
Hi,
Could you share the log when starting the inference server so we can know more about the issue?
Thanks.
root@74ffe1684e65:/workspace# python3 -m vllm.entrypoints.openai.api_server --model nvidia/Llama-3.1-8B-Instruct-FP8 --trust-remote-code --tensor-parallel-size 1 --max-model-len 1024 --gpu-memory-utilization 0.85 2>&1 | tee Llama-3.1-8B-Instruct-FP8.txt
Llama-3.1-8B-Instruct-FP8.txt (16.7 KB)
What is your docker run command? I think --net=host helps.
for example
docker run --name vllm --rm -it --network host \
--runtime=nvidia --gpus all --ipc=host \
--ulimit memlock=-1 --ulimit stack=67108864 --shm-size=16g \
-e VLLM_USE_V1=1 -e VLLM_WORKER_MULTIPROC=0 \
-v "$HOME/.cache:/root/.cache" \
nvcr.io/nvidia/vllm:25.09-py3
Don’t know it’s required but I tell vllm the host and port to be sure
python -m vllm.entrypoints.openai.api_server \
--model nvidia/Llama-3.1-8B-Instruct-FP8 \
--host 0.0.0.0 --port 8000 \
--tensor-parallel-size 1 \
--max-model-len 512 \
--max-num-seqs 2 \
--gpu-memory-utilization 0.25 \
--kv-cache-dtype=auto
Then this query works for me
curl -X 'POST' 'http://127.0.0.1:8000/v1/chat/completions' -H 'accept: application/json' -H 'Content-Type: application/json' -d '{
"model": "nvidia/Llama-3.1-8B-Instruct-FP4",
"messages": [{"role":"user", "content": "What are Chihuahuas famous for?"}]
}' |jq
First, my docker commands are same with the Page :
docker run --gpus all -it --rm nvcr.io/nvidia/vllm:25.09-py3
then, Follow your instructions exactly, it still report the error:
Unable to round-trip http request to upstream: dial tcp 127.0.0.1:8000: connect: connection refused
there is specific configuration for the docker proxy:
nvidia@ThorA:~$ cat .docker/config.json
{
“proxies”: {
“default”: {
“httpProxy”: “http://192.168.2.211:39355 ”,
“httpsProxy”: “http://192.168.2.211:39355 ”,
“noProxy”: “127.0.0.0/8”
}
}
}
Is that the reason? but i must need the proxy, or it will report the connection error.
Following may work with your proxy. I don’t have a proxy to test.
# replace with your actual proxy and any networks you don't want proxied. you don't have to do the exports but it may help.
export HTTP_PROXY=http://192.168.2.211:39355
export HTTPS_PROXY=http://192.168.2.211:39355
export NO_PROXY=127.0.0.1,localhost,192.168.0.0/16,10.0.0.0/8,172.16.0.0/12
docker run -d --rm \
--name vllm \
--runtime nvidia \
--gpus all \
--ulimit memlock=-1 \
--ulimit stack=67108864 \
-p 6678:6678 \
nvcr.io/nvidia/vllm:25.09-py3
python -m vllm.entrypoints.openai.api_server \
--host 0.0.0.0 \
--port 6678 \
--model nvidia/Llama-3.1-8B-Instruct-FP8
curl http://localhost:6678/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "nvidia/Llama-3.1-8B-Instruct-FP8",
"messages": [{"role":"user","content":"Say hello in one sentence."}],
"temperature": 0.2
}' |jq
if localhost doesn’t work in query try 127.0.0.1 or 0.0.0.0
Hi,
Just test the same command on our local device, and it can work.
So the command is fine.
$ docker run --gpus all -it --rm nvcr.io/nvidia/vllm:25.09-py3
root@8d0005b63337:/workspace# python3 -m vllm.entrypoints.openai.api_server --model nvidia/Llama-3.1-8B-Instruct-FP8 --trust-remote-code --tensor-parallel-size 1 --max-model-len 1024 --gpu-memory-utilization 0.85
...
(APIServer pid=362) INFO: Started server process [362]
(APIServer pid=362) INFO: Waiting for application startup.
(APIServer pid=362) INFO: Application startup complete.
Here is the output of the file you mentioned from our side, which doesn’t have the proxy setting.
$ cat .docker/config.json
{
"auths": {
"nvcr.io": {
"auth": "xxx"
}
}
}
Could you try the @whitesscott suggestion to see if you can access the vLLM serving?
Thanks.
docker and server in docker are OK on my device too.
please have a try to connect the server with a client, like page writes:
open a new console, run the client:
curl -X 'POST' \
'http://0.0.0.0:8000/v1/chat/completions' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"model": "nvidia/Llama-3.1-8B-Instruct-FP8",
"messages": [{"role":"user", "content": "What is NVIDIA famous for?"}]
}'
on my side, it report the error:
Unable to round-trip http request to upstream: dial tcp 0.0.0.0:8000: connect: connection refused
nvidia@ThorA:~$ export HTTP_PROXY=http://192.168.2.211:39355
nvidia@ThorA:~$ export HTTPS_PROXY=http://192.168.2.211:39355
nvidia@ThorA:~$ export NO_PROXY=127.0.0.1,localhost,192.168.0.0/16,10.0.0.0/8,172.16.0.0/12
nvidia@ThorA:~$ docker run -d --rm
–name vllm
–runtime nvidia
–gpus all
–ulimit memlock=-1
–ulimit stack=67108864
-p 6678:6678
1770b6d7d9f72039d8dd45913b1541af225dd9a8dbb39814c0f76be8af8a8640
nvidia@ThorA:~$
nvidia@ThorA:~$ python -m vllm.entrypoints.openai.api_server
–host 0.0.0.0
–port 6678
–model nvidia/Llama-3.1-8B-Instruct-FP8
/usr/bin/python: Error while finding module specification for ‘vllm.entrypoints.openai.api_server’ (ModuleNotFoundError: No module named ‘vllm.entrypoints’)
nvidia@ThorA:~$
docker running in background now.
Try following for now just to get it to work and not running the container as a -d daemon.
docker run -it --rm --net=host --runtime nvidia --privileged \
--ipc=host --ulimit memlock=-1 --ulimit stack=67108864 \
-v $HOME/.cache:/root/.cache \
-v $PWD:/workspace \
--workdir /workspace \
-e $HF_TOKEN \
nvcr.io/nvidia/vllm:25.09-py3
Then in the vllm docker container confirm this exists
ls /usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/
Still in container, try this
python -m vllm.entrypoints.openai.api_server \
--model nvidia/Llama-3.1-8B-Instruct-FP8 \
--host 0.0.0.0 \
--port 6678 \
--tensor-parallel-size 1 \
--max-model-len 512 \
--max-num-seqs 2 \
--gpu-memory-utilization 0.25 \
--kv-cache-dtype=auto \
--model nvidia/Llama-3.1-8B-Instruct-FP8
After you see
(APIServer pid=157) INFO: Started server process [157]
(APIServer pid=157) INFO: Waiting for application startup.
(APIServer pid=157) INFO: Application startup complete.
In a second terminal
curl http://0.0.0.0:6678/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "nvidia/Llama-3.1-8B-Instruct-FP8",
"messages": [{"role":"user","content":"Say hello in one sentence."}],
"temperature": 0.2
}' |jq
system
Closed
December 2, 2025, 11:49pm
14
This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.