NIM HTTP API Inference (Run Anywhere) Taking Extremely Long!

Hello,

I have been using NVIDIA NIM to inference with meta/llama-3.1-405b-instruct and meta/llama-3.1-70b-instruct for the past 3-4 days. I have been noticing an exorbitantly high inference time (>100 sec) on average. The input sequence of text is also less than 1024 tokens.

Although this issue was not noticeable for the first few days, it has been pretty apparent now. I have collectively used over 600 credits (calls) on these models. I have also double checked with different languages for using the NIM API and it seems to be a reoccurring pattern. With Python, LangChain, Node, and Shell, I am facing much longer wait times to get back a response from the NIM API hosted backends.

Would this be due to the high traffic from many customers or is this something that I can troubleshoot on my system alone? If you need more information, please let me know and I will do my best to provide it.

Thanks in advance!

The NVIDIA API catalog offers a no-cost trial experience of NVIDIA NIM, and you may experience extended wait times during periods of high load. To ensure consistent performance, we recommend the following options:

  1. Self-host the API on your cloud provider or on-prem. Research and test use is free under the ‘NVIDIA Developer Program’ access. Please note that your organization must have an NVIDIA AI Enterprise license for production use.

  2. Use serverless NIM API on Hugging Face with per-pay-use pricing. The NVIDIA AI Enterprise license is included with this option so you don’t need a separate license.