[SUPPORT] Workbench Example Project: Local RAG

Hi! This is the support thread for the Local RAG Example Project on GitHub. Any major updates we push to the project will be announced here. Further, feel free to discuss, raise issues, and ask for assistance in this thread.

Please keep discussion in this thread project-related. Any issues with the Workbench application should be raised as a standalone thread. Thanks!

I’ve noticed that the chat window will truncate the output randomly. Is there a setting to increase the output rows?

I’m trying to use this example with Nemotron Model but gives json.decode error from, get_llm(), HuggingFaceTextGenInference section, can anyone guide how can I use this with Nemotron models?

Hey Brian, the defaults are set to generate roughly a paragraph response (eg. 4-5 sentences) but feel free to play around with the code and set the number of new tokens generated (or any hyperparameter) to an appropriate amount. For example, max_new_tokens is set to 100 by default in chains.py:

def get_llm() -> LangChainLLM:
    """Create the LLM connection."""
    inference_server_url_local = ""

    llm_local = HuggingFaceTextGenInference(

    return LangChainLLM(llm=llm_local)
1 Like

Thanks - I’ll change that today and give it a test.

I’m now getting an error where chat does not run after rebuilding the environment from scratch. I have followed all the steps I did prior when this worked. Here is the error message that appears on screen. I did notice that there is a new HuggingFace meta-llama/Llama-2-7b-hf that is suffixed with the word chat - has there been a change? I’ve also attached the error log.

chat error log.txt (5.5 MB)

A clean rebuild of the project this morning now allows chat to run successfully. However, it doesn’t seem to be able to use the provided knowledge base. I will continue to test.

Hi Brian,

We’re investigating and triaging. We will get back to you shortly.


Thanks - appreciate the feedback.

Hi there!
Is it possible to split LLAMA 7b-chat-hf model on two gpus. Currently I’m using 2x RTX 4090. In project text-generation-webui, you get to split model equally on two devices, landed on this project today itself.