[SUPPORT] Workbench Example Project: Hybrid RAG

Hi! This is the support thread for the Hybrid RAG Example Project on GitHub . Any major updates we push to the project will be announced here. Further, feel free to discuss, raise issues, and ask for assistance in this thread.

Please keep discussion in this thread project-related. Any issues with the Workbench application should be raised as a standalone thread. Thanks!

What is the Workbench admin password. I would like to add a debug tool using JupiterLabs but it requires the SUDO password.

Thanks for posting.

Workbench containers are configured to run as a non-root user without sudo privileges.

To add dependencies and change the JupyterLab setup, you should:

  • add the necessary dependencies through the package manager in the UI
  • OR add them to the requirements.txt file
  • Then, include any needed configuration steps in the postBuild.bash script.
  • Finally, rebuild the Project container.

Let me know if this makes sense.

We can also post an example for you.

Thank you. I used the “Packages” feature in the UI. It supports apt and pip so it handles about everything I needed.

Awesome!!

We are going to build out the documentation on that so things are more clear.

Release Notes (05/13/2024):

  1. General Chat Application Updates and Improvements.

    • Dedicated start up landing page.
      • Instead of the backend setting up upon a user’s any first action, there is now a dedicated button to press upon page load for initial setup that will unlock tabs and auto redirect when the initial setup is ready.
      • Upon any page refresh, this button action will recognize both API server and vector database readiness and provide information to the user if needed.
      • This hopefully improves transparency, ease of use, and reduces user error upon chatbot start up.
    • A Latency info box has been added to the chat box that works across all inference modes and will display the inference latency of the last generated output.
    • Replaced hard-coded timeouts with periodic polling via curl requests. Should reduce the instances of failure and speed up time to load. Supported processes below will now all await a 200 code via periodic curl requests instead of waiting some fixed timeout:
      • Initial RAG backend startup
      • local inference server startup
      • local NIM startup
      • vector DB warmup
    • Fixed an issue where timeouts for the initial set up would prevent and/or cause lag when navigating between inference mode tabs.
  2. Updates to Cloud Endpoints: New models added to better reflect the selection on the build.nvidia.com API Catalog

    • Cloud Endpoint model selection has tripled from 4 to 12 and sorted into model families for organization
    • Mistral Model Family: Added Mistral Large and Mixtral 8x22b
    • Llama Model Family: Added Llama 3 8B and Llama 3 70B, removed Llama 2 13B to reflect NVIDIA’s API Catalog
    • Added support for Google’s model(s): Gemma 2B, Gemma 7B, Code Gemma 7B
    • Added support for Microsoft’s model(s): Phi-3 Mini
    • Added support for Snowflake’s model(s): Arctic
    • Submit button is now disabled until a model is selected from the dropdown.
  3. Updates to local TGI inference: New models added to run locally on the TGI inference server

    • Added support for ungated model NVIDIA Llama3-ChatQA-1.5-8B
    • Added support for gated models Mistral 7B Instruct v0.2 and Llama 3 8B Instruct.
    • Added a validation check when selecting/loading models whereby a warning will show if the user does not have the HuggingFace Token configured and the model selected is a gated model on the Hugging Face page.
  4. Updates to microservice Inference: Remote microservice refreshed and Local NIM flow better streamlined for ease of use.

    • Added support for remotely-running non-NIM third party services like Ollama as long as they support the OpenAI API Spec.
    • Added an optional Port field on the Remote tab to allow for greater user flexibility. Port defaults to the 9999 port used by NIMs.
    • Removed an IP address and model name lock on the Remote tab if Local NIM is running. Now, you can inference to a local NIM and a remote microservice concurrently.
    • For Local NIM, removed previously-manual prerequisite step the required users to generate the engine file for their GPU and organize the model repo before working in this inference mode.
      • Now a two-click, automated process, provided the right configs are set. (1) Generate a model repository, and then (2) start running the NIM.
      • Flow currently supported for mistral-7b-instruct-v0.1 for better streamlining. For other models, users will need to edit the code base.
    • README has been updated to better show the proper configs that need to be set in AI Workbench for this inference mode. Note these configs do not come default with the project for conciseness and require user set up.
    • A supplemental README has been added to local-nim-configs to broadly introduce the steps needed if a model swap is desired away from the default model and flow.
  5. Improved document upload UI

    • Fixed an issue where PDFs would not appear by default in the file browser
    • Fixed an issue where the file component on the page is too small to display multiple uploaded files.
    • File component is now interactive and supports drag-and-drop in addition to click-to-upload, and the “Upload Documents” button has been removed for clarity.
  6. Improved logic and robustness surrounding local NIM Gradio UI components

    • To parallel the existing set up for the local inference mode, users on the local NIM mode cannot submit a query when local microservice option is selected and the local NIM has not been started, and this is made persistent.
    • Additionally, checks in the local NIM spin-up script have been added to ensure README prerequisites for local NIM are implemented, as these configs do not come default with the project.
  7. Improved system introspection

    • The application will introspect any connected GPUs to the project and auto populate the recommended level of model quantization under local inference depending on detected VRAM. Recommended levels are as specified in the README.
    • The application will introspect for any connected model repositories (“model-store”) in the specified LOCAL_NIM_HOME location and will enable the “Start Microservice” button only when a model store is detected. If none detected, the user should generate a model repo by selecting the appropriate button.
  8. Improved warning and info popups.

    • Submitting a query for Remote NIM with an empty settings field will result in a warning before any response is generated.
    • Upon successful startup of RAG backend, an Info popup will appear notifying the user the vector database is still taking a few moments to spin up.
    • Any subsequent page refresh will result in the same Info popup appearing until the vector db is ready for use, in which case no pop up will display to the user.
    • Other similar information and warning pop ups.
    • README instructions and imagery, and information for each inference mode tab has been refreshed to be more up to date.

(5/15/2024) Pushed hotfix to improve support for OCR in PDF files.