NIM HTTP API Inference (Run Anywhere) Taking Extremely Long!

rajath.rao · September 11, 2024, 1:58am

Hello,

I have been using NVIDIA NIM to inference with meta/llama-3.1-405b-instruct and meta/llama-3.1-70b-instruct for the past 3-4 days. I have been noticing an exorbitantly high inference time (>100 sec) on average. The input sequence of text is also less than 1024 tokens.

Although this issue was not noticeable for the first few days, it has been pretty apparent now. I have collectively used over 600 credits (calls) on these models. I have also double checked with different languages for using the NIM API and it seems to be a reoccurring pattern. With Python, LangChain, Node, and Shell, I am facing much longer wait times to get back a response from the NIM API hosted backends.

Would this be due to the high traffic from many customers or is this something that I can troubleshoot on my system alone? If you need more information, please let me know and I will do my best to provide it.

Thanks in advance!

neal.vaidya · September 11, 2024, 10:14pm

The NVIDIA API catalog offers a no-cost trial experience of NVIDIA NIM, and you may experience extended wait times during periods of high load. To ensure consistent performance, we recommend the following options:

Self-host the API on your cloud provider or on-prem. Research and test use is free under the ‘NVIDIA Developer Program’ access. Please note that your organization must have an NVIDIA AI Enterprise license for production use.
Use serverless NIM API on Hugging Face with per-pay-use pricing. The NVIDIA AI Enterprise license is included with this option so you don’t need a separate license.

Topic		Replies	Views
Inferencing models from api taking very long Models jetson , nim , mistral-large , deepseek , nemotron	1	349	December 19, 2025
Not connect to endpoint https://integrate.api.nvidia.com/v1 Access/Accounts nim , llama	1	2041	February 17, 2025
NVIDIA NIM API invoked by Langchain returns statuscode 500 Access/Accounts nim , llama-31-70b-instruct , llama	1	413	September 4, 2024
Request for NVIDIA NIM API Rate Limit Increase – Model Evaluation & Personal Development Access/Accounts nim , deepseek , nemotron	2	304	April 29, 2026
Access large models (405B) with NIM after using all credits for the build.nvidia.com endpoints Access/Accounts nim , nemotron-4-340b-reward , llama-31-405b-instruct , llama	3	336	August 29, 2024
Request access to the NVIDIA NIM API - Kira AI with its own memory architecture (educational/non-profit project) Access/Accounts nim , llama	2	49	May 27, 2026
Result of nvidia nims in openai SDK and API inconsistent NVIDIA Nemotron nim , llama-31-405b-instruct , llama	0	119	January 7, 2025
Request for NVIDIA NIM API Rate Limit Increase (40 → 200 RPM) Access/Accounts nim , llama , nemotron	0	83	May 3, 2026
Request for NVIDIA NIM API Rate Limit Increase (40 → 200 RPM) Access/Accounts nim , agentic-ai	0	36	May 25, 2026
Token limit defaults to 4096 for all models Models nim	1	153	May 23, 2026

NIM HTTP API Inference (Run Anywhere) Taking Extremely Long!

Related topics