Vllm client connection refused

According to the page: vLLM | NVIDIA NGC

Start the HTTP inference server inside the container OK:

(APIServer pid=157) INFO: Started server process(157)

(APIServer pid=157) INFO: Waiting for application startup.

(APIServer pid=157) INFO: Application startup complete.

open a new console, run the client:

curl -X 'POST' \
'http://0.0.0.0:8000/v1/chat/completions' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"model": "nvidia/Llama-3.1-8B-Instruct-FP8",
"messages": [{"role":"user", "content": "What is NVIDIA famous for?"}]
}'

but, report the error:

Unable to round-trip http request to upstream: dial tcp 0.0.0.0:8000: connect: connection refused

Hi,

Could you share the log when starting the inference server so we can know more about the issue?

Thanks.

root@74ffe1684e65:/workspace# python3 -m vllm.entrypoints.openai.api_server --model nvidia/Llama-3.1-8B-Instruct-FP8 --trust-remote-code --tensor-parallel-size 1 --max-model-len 1024 --gpu-memory-utilization 0.85 2>&1 | tee Llama-3.1-8B-Instruct-FP8.txt

Llama-3.1-8B-Instruct-FP8.txt (16.7 KB)

What is your docker run command? I think --net=host helps.

for example

docker run --name vllm --rm -it --network host \
  --runtime=nvidia --gpus all --ipc=host \
  --ulimit memlock=-1 --ulimit stack=67108864 --shm-size=16g \
  -e VLLM_USE_V1=1 -e VLLM_WORKER_MULTIPROC=0 \
  -v "$HOME/.cache:/root/.cache" \
  nvcr.io/nvidia/vllm:25.09-py3

Don’t know it’s required but I tell vllm the host and port to be sure

python -m vllm.entrypoints.openai.api_server \
  --model nvidia/Llama-3.1-8B-Instruct-FP8 \
  --host 0.0.0.0 --port 8000 \
  --tensor-parallel-size 1 \
  --max-model-len 512 \
  --max-num-seqs 2 \
  --gpu-memory-utilization 0.25 \
  --kv-cache-dtype=auto

Then this query works for me

curl -X 'POST' 'http://127.0.0.1:8000/v1/chat/completions' -H 'accept: application/json' -H 'Content-Type: application/json' -d '{
"model": "nvidia/Llama-3.1-8B-Instruct-FP4",
"messages": [{"role":"user", "content": "What are Chihuahuas famous for?"}]
}' |jq

First, my docker commands are same with the Page :

docker run --gpus all -it --rm nvcr.io/nvidia/vllm:25.09-py3

then, Follow your instructions exactly, it still report the error:

Unable to round-trip http request to upstream: dial tcp 127.0.0.1:8000: connect: connection refused

there is specific configuration for the docker proxy:

nvidia@ThorA:~$ cat .docker/config.json
{
“proxies”: {
“default”: {
“httpProxy”: “http://192.168.2.211:39355”,
“httpsProxy”: “http://192.168.2.211:39355”,
“noProxy”: “127.0.0.0/8”
}
}
}

Is that the reason? but i must need the proxy, or it will report the connection error.

Following may work with your proxy. I don’t have a proxy to test.

# replace with your actual proxy and any networks you don't want proxied. you don't have to do the exports but it may help.
export HTTP_PROXY=http://192.168.2.211:39355
export HTTPS_PROXY=http://192.168.2.211:39355
export NO_PROXY=127.0.0.1,localhost,192.168.0.0/16,10.0.0.0/8,172.16.0.0/12
docker run -d --rm \
  --name vllm \
  --runtime nvidia \
  --gpus all \
  --ulimit memlock=-1 \
  --ulimit stack=67108864 \
  -p 6678:6678 \
  nvcr.io/nvidia/vllm:25.09-py3
  python -m vllm.entrypoints.openai.api_server \
    --host 0.0.0.0 \
    --port 6678 \
    --model nvidia/Llama-3.1-8B-Instruct-FP8
curl http://localhost:6678/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
        "model": "nvidia/Llama-3.1-8B-Instruct-FP8",
        "messages": [{"role":"user","content":"Say hello in one sentence."}],
        "temperature": 0.2
      }' |jq

if localhost doesn’t work in query try 127.0.0.1 or 0.0.0.0

Hi,

Just test the same command on our local device, and it can work.
So the command is fine.

$ docker run --gpus all -it --rm nvcr.io/nvidia/vllm:25.09-py3
root@8d0005b63337:/workspace# python3 -m vllm.entrypoints.openai.api_server --model nvidia/Llama-3.1-8B-Instruct-FP8 --trust-remote-code --tensor-parallel-size 1 --max-model-len 1024 --gpu-memory-utilization 0.85
...
(APIServer pid=362) INFO:     Started server process [362]
(APIServer pid=362) INFO:     Waiting for application startup.
(APIServer pid=362) INFO:     Application startup complete.

Here is the output of the file you mentioned from our side, which doesn’t have the proxy setting.

$ cat .docker/config.json
{
	"auths": {
		"nvcr.io": {
			"auth": "xxx"
		}
	}
}

Could you try the @whitesscott suggestion to see if you can access the vLLM serving?

Thanks.

docker and server in docker are OK on my device too.

please have a try to connect the server with a client, like page writes:

open a new console, run the client:

curl -X 'POST' \
'http://0.0.0.0:8000/v1/chat/completions' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"model": "nvidia/Llama-3.1-8B-Instruct-FP8",
"messages": [{"role":"user", "content": "What is NVIDIA famous for?"}]
}'

on my side, it report the error:

Unable to round-trip http request to upstream: dial tcp 0.0.0.0:8000: connect: connection refused

nvidia@ThorA:~$ export HTTP_PROXY=http://192.168.2.211:39355
nvidia@ThorA:~$ export HTTPS_PROXY=http://192.168.2.211:39355
nvidia@ThorA:~$ export NO_PROXY=127.0.0.1,localhost,192.168.0.0/16,10.0.0.0/8,172.16.0.0/12
nvidia@ThorA:~$ docker run -d --rm 
–name vllm 
–runtime nvidia 
–gpus all 
–ulimit memlock=-1 
–ulimit stack=67108864 
-p 6678:6678 

1770b6d7d9f72039d8dd45913b1541af225dd9a8dbb39814c0f76be8af8a8640

nvidia@ThorA:~$
nvidia@ThorA:~$ python -m vllm.entrypoints.openai.api_server 
–host 0.0.0.0 
–port 6678 
–model nvidia/Llama-3.1-8B-Instruct-FP8
/usr/bin/python: Error while finding module specification for ‘vllm.entrypoints.openai.api_server’ (ModuleNotFoundError: No module named ‘vllm.entrypoints’)
nvidia@ThorA:~$

docker running in background now.

Try following for now just to get it to work and not running the container as a -d daemon.

docker run -it --rm --net=host --runtime nvidia --privileged \
  --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 \
  -v $HOME/.cache:/root/.cache \
  -v $PWD:/workspace \
  --workdir /workspace \
  -e $HF_TOKEN \
  nvcr.io/nvidia/vllm:25.09-py3

Then in the vllm docker container confirm this exists
ls /usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/

Still in container, try this

python -m vllm.entrypoints.openai.api_server \
  --model nvidia/Llama-3.1-8B-Instruct-FP8 \
  --host 0.0.0.0 \
  --port 6678  \
  --tensor-parallel-size 1 \
  --max-model-len 512 \
  --max-num-seqs 2 \
  --gpu-memory-utilization 0.25 \
  --kv-cache-dtype=auto \
  --model nvidia/Llama-3.1-8B-Instruct-FP8

After you see

(APIServer pid=157) INFO:     Started server process [157]
(APIServer pid=157) INFO:     Waiting for application startup.
(APIServer pid=157) INFO:     Application startup complete.

In a second terminal

curl http://0.0.0.0:6678/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
        "model": "nvidia/Llama-3.1-8B-Instruct-FP8",
        "messages": [{"role":"user","content":"Say hello in one sentence."}],
        "temperature": 0.2
      }' |jq

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.