After setting up a server using the NVIDIA Speculative Decoding sample, I am able to run AnythingLLM on the GB10 using the setup link below.
However, when using the built-in LLM provider’s NVIDIA NIM + Speculative Decoding server,
I can only use the main model nvidia/Llama-3.3-70B-Instruct-FP4, and I am unable to use the draft model nvidia/Llama-3.1-8B-Instruct-FP4.
Do you have any suggestions regarding this questions?
Thank you.
When using NVIDIA NIM in AnythingLLM, the model selection list only shows nvidia/Llama-3.3-70B-Instruct-FP4 and does not include nvidia/Llama-3.1-8B-Instruct-FP4.
However, in “Step 3: Test the draft–target setup,” the test command requires specifying nvidia/Llama-3.1-8B-Instruct-FP4 as the speculative model.
Although NVIDIA NIM can still chat normally in AnythingLLM, I’m wondering: Is AnythingLLM’s NVIDIA NIM integration actually using Speculative Decoding?
If not, please also help answer the following questions. Thank you:
Is there any way for AnythingLLM’s NVIDIA NIM integration to directly use NVIDIA Speculative Decoding after setting up the server as described in NVIDIA’s documentation? Or does the UI currently require manual integration to specify the speculative model?
What methods are available to confirm whether NVIDIA Speculative Decoding is being used? Is it possible to enable server logs?
Hi, AnythingLLM is a third-party platform, and its NIM integration is being deprecated per the notice on their site. As with any third-party tooling, NVIDIA has limited visibility and support. Please contact AnythingLLM directly for further assistance.