Inferencing models from api taking very long

i have trying to inference from the NIM but it take very very long time

[2025-12-18 03:30:49,971][httpx][INFO] - HTTP Request: POST https://integrate.api.nvidia.com/v1/chat/completions "HTTP/1.1 200 OK"
[2025-12-18 03:32:10,143][httpx][INFO] - HTTP Request: POST https://integrate.api.nvidia.com/v1/chat/completions "HTTP/1.1 200 OK"
[2025-12-18 03:42:10,673][openai._base_client][INFO] - Retrying request to /chat/completions in 0.497196 seconds
[2025-12-18 03:42:46,472][httpx][INFO] - HTTP Request: POST https://integrate.api.nvidia.com/v1/chat/completions "HTTP/1.1 200 OK"
[2025-12-18 03:42:48,132][httpx][INFO] - HTTP Request: POST https://integrate.api.nvidia.com/v1/embeddings "HTTP/1.1 200 OK"
[2025-12-18 03:42:48,621][shinka.core.novelty_judge][INFO] - Top-5 similarity scores: ['0.97']
[2025-12-18 03:47:18,910][openai._base_client][INFO] - Retrying request to /chat/completions in 0.490957 seconds
[2025-12-18 03:51:49,718][openai._base_client][INFO] - Retrying request to /chat/completions in 0.826494 seconds
[2025-12-18 03:56:20,819][backoff][INFO] - Backing off query_nvidia(...) for 0.8s (openai.APIConnectionError: Connection error.)
[2025-12-18 03:56:20,819][shinka.llm.models.nvidia][INFO] - NVIDIA - Retry 1 due to error: Connection error.. Waiting 0.8s...
[2025-12-18 04:00:52,054][openai._base_client][INFO] - Retrying request to /chat/completions in 0.414927 seconds
[2025-12-18 04:10:52,748][openai._base_client][INFO] - Retrying request to /chat/completions in 0.762142 seconds
[2025-12-18 04:15:23,824][backoff][INFO] - Backing off query_nvidia(...) for 1.4s (openai.APIConnectionError: Connection error.)
[2025-12-18 04:15:23,824][shinka.llm.models.nvidia][INFO] - NVIDIA - Retry 2 due to error: Connection error.. Waiting 1.4s...
[2025-12-18 04:19:55,584][openai._base_client][INFO] - Retrying request to /chat/completions in 0.491452 seconds
[2025-12-18 04:24:26,365][openai._base_client][INFO] - Retrying request to /chat/completions in 0.913321 seconds
[2025-12-18 04:28:57,720][backoff][INFO] - Backing off query_nvidia(...) for 3.5s (openai.APIConnectionError: Connection error.)
[2025-12-18 04:28:57,720][shinka.llm.models.nvidia][INFO] - NVIDIA - Retry 3 due to error: Connection error.. Waiting 3.5s...
[2025-12-18 04:39:01,505][openai._base_client][INFO] - Retrying request to /chat/completions in 0.391560 seconds
[2025-12-18 04:43:32,209][openai._base_client][INFO] - Retrying request to /chat/completions in 0.843957 seconds
[2025-12-18 04:48:03,491][backoff][INFO] - Backing off query_nvidia(...) for 4.9s (openai.APIConnectionError: Connection error.)
[2025-12-18 04:48:03,491][shinka.llm.models.nvidia][INFO] - NVIDIA - Retry 4 due to error: Connection error.. Waiting 4.9s...
[2025-12-18 04:52:38,675][openai._base_client][INFO] - Retrying request to /chat/completions in 0.397126 seconds
[2025-12-18 05:02:39,442][openai._base_client][INFO] - Retrying request to /chat/completions in 0.892992 seconds
[2025-12-18 05:12:40,643][backoff][INFO] - Backing off query_nvidia(...) for 3.5s (openai.APITimeoutError: Request timed out.)
[2025-12-18 05:12:40,643][shinka.llm.models.nvidia][INFO] - NVIDIA - Retry 5 due to error: Request timed out.. Waiting 3.5s...
[2025-12-18 05:17:14,157][openai._base_client][INFO] - Retrying request to /chat/completions in 0.451072 seconds
[2025-12-18 05:21:44,898][openai._base_client][INFO] - Retrying request to /chat/completions in 0.946891 seconds
[2025-12-18 05:26:16,186][backoff][INFO] - Backing off query_nvidia(...) for 10.2s (openai.APIConnectionError: Connection error.)
[2025-12-18 05:26:16,186][shinka.llm.models.nvidia][INFO] - NVIDIA - Retry 6 due to error: Connection error.. Waiting 10.2s...
[2025-12-18 05:36:26,551][openai._base_client][INFO] - Retrying request to /chat/completions in 0.468014 seconds
[2025-12-18 05:40:57,508][openai._base_client][INFO] - Retrying request to /chat/completions in 0.857667 seconds
[2025-12-18 05:45:28,435][backoff][INFO] - Backing off query_nvidia(...) for 14.1s (openai.APIConnectionError: Connection error.)
[2025-12-18 05:45:28,436][shinka.llm.models.nvidia][INFO] - NVIDIA - Retry 7 due to error: Connection error.. Waiting 14.1s...
[2025-12-18 05:50:12,899][openai._base_client][INFO] - Retrying request to /chat/completions in 0.400749 seconds
[2025-12-18 06:00:13,503][openai._base_client][INFO] - Retrying request to /chat/completions in 0.803331 seconds
[2025-12-18 06:00:19,107][backoff][INFO] - Backing off query_nvidia(...) for 4.0s (openai.APIConnectionError: Connection error.)
[2025-12-18 06:00:19,107][shinka.llm.models.nvidia][INFO] - NVIDIA - Retry 8 due to error: Connection error.. Waiting 4.0s...
[2025-12-18 06:04:53,478][openai._base_client][INFO] - Retrying request to /chat/completions in 0.398750 seconds
[2025-12-18 06:09:26,468][openai._base_client][INFO] - Retrying request to /chat/completions in 0.760055 seconds
[2025-12-18 06:19:32,943][backoff][INFO] - Backing off query_nvidia(...) for 11.8s (openai.APITimeoutError: Request timed out.)
[2025-12-18 06:19:32,943][shinka.llm.models.nvidia][INFO] - NVIDIA - Retry 9 due to error: Request timed out.. Waiting 11.8s...
[2025-12-18 06:29:48,940][openai._base_client][INFO] - Retrying request to /chat/completions in 0.465329 seconds
[2025-12-18 06:34:19,739][openai._base_client][INFO] - Retrying request to /chat/completions in 0.968840 seconds


those are the models im using:

  • “deepseek-ai/deepseek-v3.2”
- "nvidia/nemotron-3-nano-30b-a3b"

- "moonshotai/kimi-k2-thinking"

- "mistralai/devstral-2-123b-instruct-2512"

- "mistralai/mistral-large-3-675b-instruct-2512"

- "deepseek-ai/deepseek-v3.1-terminus"

and the input and output always under 3000 tokens, why?

Hi @duytannguyen479, couple troubleshooting questions –

  1. What client library are you using to send the requests? Are you setting a timeout value manually?
  2. For the requests that are taking a long time, do they eventually succeed? Or are they just retrying forever?
  3. How frequently are you sending requests? We have some rate limiting built into our APIs, and you could be running up against that.
  4. Does streaming output or sending very short requests have any different behavior?