Hello,
I have been using NVIDIA NIM to inference with meta/llama-3.1-405b-instruct and meta/llama-3.1-70b-instruct for the past 3-4 days. I have been noticing an exorbitantly high inference time (>100 sec) on average. The input sequence of text is also less than 1024 tokens.
Although this issue was not noticeable for the first few days, it has been pretty apparent now. I have collectively used over 600 credits (calls) on these models. I have also double checked with different languages for using the NIM API and it seems to be a reoccurring pattern. With Python, LangChain, Node, and Shell, I am facing much longer wait times to get back a response from the NIM API hosted backends.
Would this be due to the high traffic from many customers or is this something that I can troubleshoot on my system alone? If you need more information, please let me know and I will do my best to provide it.
Thanks in advance!