[SUPPORT] Workbench Example Project: Local RAG

Hi! This is the support thread for the Local RAG Example Project on GitHub. Any major updates we push to the project will be announced here. Further, feel free to discuss, raise issues, and ask for assistance in this thread.

Please keep discussion in this thread project-related. Any issues with the Workbench application should be raised as a standalone thread. Thanks!

I’ve noticed that the chat window will truncate the output randomly. Is there a setting to increase the output rows?

I’m trying to use this example with Nemotron Model but gives json.decode error from, get_llm(), HuggingFaceTextGenInference section, can anyone guide how can I use this with Nemotron models?

Hey Brian, the defaults are set to generate roughly a paragraph response (eg. 4-5 sentences) but feel free to play around with the code and set the number of new tokens generated (or any hyperparameter) to an appropriate amount. For example, max_new_tokens is set to 100 by default in chains.py:

@lru_cache
def get_llm() -> LangChainLLM:
    """Create the LLM connection."""
    inference_server_url_local = "http://127.0.0.1:9090/"

    llm_local = HuggingFaceTextGenInference(
        inference_server_url=inference_server_url_local,
        max_new_tokens=100,
        top_k=10,
        top_p=0.95,
        typical_p=0.95,
        temperature=0.7,
        repetition_penalty=1.03,
        streaming=True
    )

    return LangChainLLM(llm=llm_local)
1 Like

Thanks - I’ll change that today and give it a test.

I’m now getting an error where chat does not run after rebuilding the environment from scratch. I have followed all the steps I did prior when this worked. Here is the error message that appears on screen. I did notice that there is a new HuggingFace meta-llama/Llama-2-7b-hf that is suffixed with the word chat - has there been a change? I’ve also attached the error log.


chat error log.txt (5.5 MB)

A clean rebuild of the project this morning now allows chat to run successfully. However, it doesn’t seem to be able to use the provided knowledge base. I will continue to test.

Hi Brian,

We’re investigating and triaging. We will get back to you shortly.

Tyler

Thanks - appreciate the feedback.

Hi there!
Is it possible to split LLAMA 7b-chat-hf model on two gpus. Currently I’m using 2x RTX 4090. In project text-generation-webui, you get to split model equally on two devices, landed on this project today itself.

When I try upload a document I get an error. It is a text document. But it just says error. Furthermore , before I try upload my chatbot works fine , after I try upload when I try submit any text to the chatbot, I get an error I need to restart the environment.

Any ideas why uploading text documents is causing it to go into an error state?

Do you mind providing logs and screenshots of the issue? The vector database takes a while to spin up, so you may be uploading documents before the database is ready to receive them. But I wouldn’t know for sure without logs/screenshots.

We have tried to mitigate this by building in progress bars and warnings into the updated Hybrid RAG project here: GitHub - NVIDIA/workbench-example-hybrid-rag: An NVIDIA AI Workbench example project for Retrieval Augmented Generation (RAG)

All,

This project has been refreshed to become the Hybrid RAG example project. This local RAG project will likely become discontinued.

Github Repo here

New DevZone Thread here

This refreshed project consists of but is not limited to:

  • Running RAG locally is a subset of this project, in addition to new features for Cloud and Microservice/NIM-based RAG.
  • The UI has been updated as well to allow for expanded user settings.
  • Updates and bug fixes