[SUPPORT] Workbench Example Project: Hybrid RAG

Hi! This is the support thread for the Hybrid RAG Example Project on GitHub . Any major updates we push to the project will be announced here. Further, feel free to discuss, raise issues, and ask for assistance in this thread.

Please keep discussion in this thread project-related. Any issues with the Workbench application should be raised as a standalone thread. Thanks!

1 Like

What is the Workbench admin password. I would like to add a debug tool using JupiterLabs but it requires the SUDO password.

Thanks for posting.

Workbench containers are configured to run as a non-root user without sudo privileges.

To add dependencies and change the JupyterLab setup, you should:

  • add the necessary dependencies through the package manager in the UI
  • OR add them to the requirements.txt file
  • Then, include any needed configuration steps in the postBuild.bash script.
  • Finally, rebuild the Project container.

Let me know if this makes sense.

We can also post an example for you.

Thank you. I used the “Packages” feature in the UI. It supports apt and pip so it handles about everything I needed.

Awesome!!

We are going to build out the documentation on that so things are more clear.

Release Notes (05/13/2024):

  1. General Chat Application Updates and Improvements.

    • Dedicated start up landing page.
      • Instead of the backend setting up upon a user’s any first action, there is now a dedicated button to press upon page load for initial setup that will unlock tabs and auto redirect when the initial setup is ready.
      • Upon any page refresh, this button action will recognize both API server and vector database readiness and provide information to the user if needed.
      • This hopefully improves transparency, ease of use, and reduces user error upon chatbot start up.
    • A Latency info box has been added to the chat box that works across all inference modes and will display the inference latency of the last generated output.
    • Replaced hard-coded timeouts with periodic polling via curl requests. Should reduce the instances of failure and speed up time to load. Supported processes below will now all await a 200 code via periodic curl requests instead of waiting some fixed timeout:
      • Initial RAG backend startup
      • local inference server startup
      • local NIM startup
      • vector DB warmup
    • Fixed an issue where timeouts for the initial set up would prevent and/or cause lag when navigating between inference mode tabs.
  2. Updates to Cloud Endpoints: New models added to better reflect the selection on the build.nvidia.com API Catalog

    • Cloud Endpoint model selection has tripled from 4 to 12 and sorted into model families for organization
    • Mistral Model Family: Added Mistral Large and Mixtral 8x22b
    • Llama Model Family: Added Llama 3 8B and Llama 3 70B, removed Llama 2 13B to reflect NVIDIA’s API Catalog
    • Added support for Google’s model(s): Gemma 2B, Gemma 7B, Code Gemma 7B
    • Added support for Microsoft’s model(s): Phi-3 Mini
    • Added support for Snowflake’s model(s): Arctic
    • Submit button is now disabled until a model is selected from the dropdown.
  3. Updates to local TGI inference: New models added to run locally on the TGI inference server

    • Added support for ungated model NVIDIA Llama3-ChatQA-1.5-8B
    • Added support for gated models Mistral 7B Instruct v0.2 and Llama 3 8B Instruct.
    • Added a validation check when selecting/loading models whereby a warning will show if the user does not have the HuggingFace Token configured and the model selected is a gated model on the Hugging Face page.
  4. Updates to microservice Inference: Remote microservice refreshed and Local NIM flow better streamlined for ease of use.

    • Added support for remotely-running non-NIM third party services like Ollama as long as they support the OpenAI API Spec.
    • Added an optional Port field on the Remote tab to allow for greater user flexibility. Port defaults to the 9999 port used by NIMs.
    • Removed an IP address and model name lock on the Remote tab if Local NIM is running. Now, you can inference to a local NIM and a remote microservice concurrently.
    • For Local NIM, removed previously-manual prerequisite step the required users to generate the engine file for their GPU and organize the model repo before working in this inference mode.
      • Now a two-click, automated process, provided the right configs are set. (1) Generate a model repository, and then (2) start running the NIM.
      • Flow currently supported for mistral-7b-instruct-v0.1 for better streamlining. For other models, users will need to edit the code base.
    • README has been updated to better show the proper configs that need to be set in AI Workbench for this inference mode. Note these configs do not come default with the project for conciseness and require user set up.
    • A supplemental README has been added to local-nim-configs to broadly introduce the steps needed if a model swap is desired away from the default model and flow.
  5. Improved document upload UI

    • Fixed an issue where PDFs would not appear by default in the file browser
    • Fixed an issue where the file component on the page is too small to display multiple uploaded files.
    • File component is now interactive and supports drag-and-drop in addition to click-to-upload, and the “Upload Documents” button has been removed for clarity.
  6. Improved logic and robustness surrounding local NIM Gradio UI components

    • To parallel the existing set up for the local inference mode, users on the local NIM mode cannot submit a query when local microservice option is selected and the local NIM has not been started, and this is made persistent.
    • Additionally, checks in the local NIM spin-up script have been added to ensure README prerequisites for local NIM are implemented, as these configs do not come default with the project.
  7. Improved system introspection

    • The application will introspect any connected GPUs to the project and auto populate the recommended level of model quantization under local inference depending on detected VRAM. Recommended levels are as specified in the README.
    • The application will introspect for any connected model repositories (“model-store”) in the specified LOCAL_NIM_HOME location and will enable the “Start Microservice” button only when a model store is detected. If none detected, the user should generate a model repo by selecting the appropriate button.
  8. Improved warning and info popups.

    • Submitting a query for Remote NIM with an empty settings field will result in a warning before any response is generated.
    • Upon successful startup of RAG backend, an Info popup will appear notifying the user the vector database is still taking a few moments to spin up.
    • Any subsequent page refresh will result in the same Info popup appearing until the vector db is ready for use, in which case no pop up will display to the user.
    • Other similar information and warning pop ups.
    • README instructions and imagery, and information for each inference mode tab has been refreshed to be more up to date.

(5/15/2024) Pushed hotfix to improve support for OCR in PDF files.

(05/28/2024) Pushed updates for spec v2.1 compatibility (AI Workbench version 0.50.16)

Release Notes (06/03/2024):

  1. Improved support for NIMs, now in GA! Sign up here for access.

    • Replaced model name field with full container image/tag field for ease of use (eg. copy-paste).
      • Improved local NIM switchability by replacing the model field with the full NIM container path. Users can copy and paste their NIM container directly in the chat UI.
      • Improved Local NIM flow to replace model-repo-generate step with a NIM sidecar container pull step to better align with new NIM release.
    • Fixed an issue with Remote NIM support returning null token for vLLM-backend NIMs
    • Set defaults for the project settings to better align with the quickstart contents in the NIM documentation (now uses vLLM backend)
  2. Improved Metrics Tracking

    • Removed “clear query” button to accommodate for Show Metrics panel functionality.
    • Added support for new metrics:
      • retrieval time (ms)
      • TTFT (ms)
      • generation time (ms)
      • E2E (ms)
      • approx. tokens in response
      • approx. tokens generated per second
      • approx. inter-token latency (ITL)
  3. Expanded Cloud supported models (12 → 18)

    • Added support for IBM’s Granite Code models to better align with NVIDIA’s API Catalog
      • Granite 8B Code Instruct
      • Granite 34B Code Instruct
    • Widened support for Microsoft’s Phi-3 models to better align with NVIDIA’s API Catalog
      • Phi-3 Mini (4k)
      • Phi-3 Small (8k)
      • Phi-3 Small (128k)
      • Phi-3 Medium (4k)
    • Implemented temporary workaround to fix an issue with Microsoft’s Phi-3 model not supporting penalty parameters.
  4. Expanded local model selection for locally-running RAG

    • Added ungated model for local HF TGI: microsoft/Phi-3-mini-128k-instruct
    • Add filtering option to filter local models dropdown by gated vs ungated models
  5. Additional Output Customization

    • Added support for new Output Settings parameters:
      • top_p
      • frequency penalty
      • presence penalty
    • Increase max new tokens to generate to up to 2048 max tokens to generate (from 512)
    • Dynamic max new tokens to generate limits set depending on auto system introspection
  6. General Usability

    • Improved UI clutter by turning some major UI components collapsible.
      • Right hand inference settings panel can collapse and expand to reduce clutter
      • Output parameters sliders now hidden by default to reduce clutter, but can be expanded
    • Improved error messaging and forwarding of issues to the frontend UI.
    • Increase timeouts to capture a broader range of user setups
    • Ongoing improvements in documentation of code.
1 Like

Release Notes (6/27/2024):

  1. Expanded selection of cloud endpoint models.
    • NVIDIA:
      • Llama3-ChatQA-1.5-8b
      • Llama3-ChatQA-1.5-70b
      • Nemotron-4-340b-instruct
    • Mistral:
      • Mistral-7b-Instruct-v0.3
      • Codestral-22b-instruct-v0.1
    • Upstage:
      • Solar-10.7B-Instruct
  2. Improved document introspection
    • See which documents you have uploaded and deleted, as well as their current status under the new Show Documents button under the file upload widget.
  3. Miscellaneous
    • Updated README and documentation.
    • Refactoring of code for organizational clarity.

Pulled down this repo and ran it on my local AI Workstation today with a local RTX 3080TI GPU 12GB with an ungated model. I was surprised at how easy it was to get this running with nvidia/Llama3-ChatQA-1.58B.

Minor nits

  1. It asked me for the cloud endpoint key before it allowed me to select Local System instead of CloudEndpoint
  2. I thought the microsoft/Phi-3-mini-128-instruct would run on my 12gb card but it did not. There was a ERR: You may have timed out or are facing memory issues. In AI Workbench, check Output > Chat for details. server error that told me to go to “
output
” in a red window that disappeared. I have no idea where those logs are. nvidia-smi never showed memory increasing so is it some prep stage that dies?
  3. Is there ever a situation where you wouldn’t use the Vector Database?
  4. nvidia/llama3-ChatQA-1.5-8B seems 3 years old given some date answers it gave. What is a late time model?
  5. Is there any way to preserve the vector store across restarts

A video walking through the short process of cloning the hybrid rag project through to running a first query all with a local GPU and vector database.

Hello Guys

I have been experimenting with RAG and i seem to have a challenge uploading a pdf demo document which is less than 1mb in size. My question is:

  • Are there any document size limits while uploading on RAG in the AI Workbench.
  • Is there a way i can get around this issue coz on 75% the upload returns an error.

Looking forward to your help

  1. It asked me for the cloud endpoint key before it allowed me to select Local System instead of CloudEndpoint

    • This actually brings up an interesting point. The original rationale for using the cloud endpoint option as the “default” option is (1) many users will be arriving to the project from the build.nvidia.com API catalog and will be thinking of testing those endpoints out in development and (2) cloud endpoints can be used regardless of hardware (eg. CPU only systems as well). However, the drawback is that added step of configuring the API key.
    • We could consider setting the local inference as default in a future update, which will eliminate the configuration step for the cloud API token and make things easier to “just get up and running” with minimal barriers, but it does introduce a new bottleneck where users would be required to have a capable enough GPU to run local models. We think the former option is more accessible to users, but if that calculation has changed, we are happy to adapt.
  2. I thought the microsoft/Phi-3-mini-128-instruct would run on my 12gb card but it did not. There was a ERR: You may have timed out or are facing memory issues. In AI Workbench, check Output > Chat for details. server error that told me to go to “
output
” in a red window that disappeared. I have no idea where those logs are. nvidia-smi never showed memory increasing so is it some prep stage that dies?

    • Got it, it should be able to run on that card. Ensure you select “Load Model” before “Start Server” to prefetch the model weights. The UI error message is somewhat generic with the intention of pointing users to the chat logs for more granular diagnosis. On the Workbench UI, you can navigate to Output > Chat (from dropdown) to see the granular, real-time logs of the chat application.
    • If you had no other process running on the GPU, my guess is the download stage may have timed out. Typically you can restart the environment and try the download again. Those logs should give you a better picture, and feel free to post them in this thread if you need further help.
  3. Is there ever a situation where you wouldn’t use the Vector Database?

    • Retrieval Augmented Generation (RAG) by definition requires some form of datastore to store the documents you want to retrieve and use. If you were just using plain inferencing with an out-of-the-box model as a chatbot, you can forego the vector DB.
  4. nvidia/llama3-ChatQA-1.5-8B seems 3 years old given some date answers it gave. What is a late time model?

    • You can read more about that model and its pretraining datasets here. As a recent model, it should have fairly recent recall, though I don’t know the exact cutoff date.
    • You can try the more recent models, such as Llama 3 8/70B Instruct. These newer released models may have their pretraining cut off at a later date compared to some older models.
  5. Is there any way to preserve the vector store across restarts

    • This project is meant more as a single end to end workflow to demonstrate the power of RAG and the flexibility of a hybrid setup.
    • For a developer kit where each component of RAG is broken off as its own project application, I recommend checking out the NIM-anywhere kit. This is a workbench project that is more heavily focused on NIM usage, but puts together a simple RAG application with relevant components broken up as separate applications; the database can be persistent as long as it is running.

Hope this helps!

PS. That video is awesome! Thanks for sharing it with us :)

Hi! Thanks for reaching out. Happy to help!

  • Are there any document size limits while uploading on RAG in the AI Workbench.
  • Is there a way i can get around this issue coz on 75% the upload returns an error.

So a 1MB file should be fine. Do you mind doing the following? Go to your AI Workbench UI window and click Output. Then, select Chat from the dropdown. You will be able to find an error message there. Do you mind forwarding the error here so we can take a look? Thanks!

In the meantime, I can speak to the vector database more generally. The vector store included in the project by default is Milvus-lite, which is not a production grade vector store and is meant for prototyping and development. It may be set to CPU by default instead of GPU acceleration, so you may find uploading documents to be a bit slow, depending on your hardware/network/setup/etc. If you are facing a ReadError/Timeout issue at 120s, you can extend the window by adjusting the timeout parameters here.

I think I have found the error regarding (2). Is this what you are seeing?

NotImplementedError: rope scaling type longrope is not implemented or invalid rank=0 Error: ShardCannotStart

A cursory search reveals this may be due to the extended context length (128k) causing issues.

Assuming this is the issue, we may consider validating and replacing that model with a lower context length version, eg. 4k, in a future release. In the meantime, you can also type in your own models into the dropdown field if you would like to play around with other non-default models from Hugging Face :)

microsoft/Phi-3-mini-128-instruct not loading:

Might a library version mismatch with a missing positional parameter or a problem with the latest text-generation-interface. They can tell me if I’m misreading their codebase and the error message.

Opened Regression: get_weights_col_packed_qkv() quantize parameter not included in calls so fails with error missing positional parfameter - merge error? · Issue #2236 · huggingface/text-generation-inference · GitHub

This looks relevant something about a missing parameter

2024-07-15T02:50:11.941076Z  INFO text_generation_launcher: Detected system cuda
Polling inference server. Awaiting status 200; trying again in 5s. 
2024-07-15T02:50:14.461678Z ERROR text_generation_launcher: Error when initializing model
Traceback (most recent call last):
  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())
  File "/home/workbench/.local/lib/python3.10/site-packages/typer/main.py", line 309, in __call__
    return get_command(self)(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/home/workbench/.local/lib/python3.10/site-packages/typer/core.py", line 723, in main
    return _main(
  File "/home/workbench/.local/lib/python3.10/site-packages/typer/core.py", line 193, in _main
    rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/home/workbench/.local/lib/python3.10/site-packages/typer/main.py", line 692, in wrapper
    return callback(**use_params)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 106, in serve
    server.serve(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 297, in serve
    asyncio.run(
  File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
    self.run_forever()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
    self._run_once()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
    handle._run()
  File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
> File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 231, in serve_inner
    model = get_model(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/__init__.py", line 647, in get_model
    return FlashCausalLM(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 897, in __init__
    model = model_class(prefix, config, weights)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 479, in __init__
    self.model = FlashLlamaModel(prefix, config, weights)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 400, in __init__
    [
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 401, in <listcomp>
    FlashLlamaLayer(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 333, in __init__
    self.self_attn = FlashLlamaAttention(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 150, in __init__
    self.query_key_value = load_attention(config, prefix, weights, index)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 66, in load_attention
    base_layer = TensorParallelColumnLinear.load_qkv(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/layers/tensor_parallel.py", line 151, in load_qkv
    weight = weights.get_weights_col_packed_qkv(
TypeError: Weights.get_weights_col_packed_qkv() missing 1 required positional argument: 'quantize'
2024-07-15T02:50:15.716159Z ERROR shard-manager: text_generation_launcher: Shard complete standard error output:

and

TypeError: Weights.get_weights_col_packed_qkv() missing 1 required positional 
argument: 'quantize' rank=0
Error: ShardCannotStart
2024-07-15T21:52:01.495921Z ERROR text_generation_launcher: Shard 0 failed to start
2024-07-15T21:52:01.495947Z  INFO text_generation_launcher: Shutting down shards

Good Day @edwli

I have changed the timeout to 200 from 120 seconds according to the link shared on my Github.

The error I am getting is as follows:

File “/project/code/chatui/pages/converse.py”, line 978, in _document_upload
file_paths = utils.upload_file(files, client)
File “/project/code/chatui/pages/utils.py”, line 47, in upload_file
client.upload_documents(file_paths)
File “/project/code/chatui/chat_client.py”, line 118, in upload_documents
_ = requests.post(
File “/home/workbench/.conda/envs/ui-env/lib/python3.10/site-packages/requests/api.py”, line 115, in post
return request(“post”, url, data=data, json=json, **kwargs)
File “/home/workbench/.conda/envs/ui-env/lib/python3.10/site-packages/requests/api.py”, line 59, in request
return session.request(method=method, url=url, **kwargs)
File “/home/workbench/.conda/envs/ui-env/lib/python3.10/site-packages/requests/sessions.py”, line 589, in request
resp = self.send(prep, **send_kwargs)
File “/home/workbench/.conda/envs/ui-env/lib/python3.10/site-packages/requests/sessions.py”, line 703, in send
r = adapter.send(request, **kwargs)
File “/home/workbench/.conda/envs/ui-env/lib/python3.10/site-packages/requests/adapters.py”, line 713, in send
raise ReadTimeout(e, request=request)
requests.exceptions.ReadTimeout: HTTPConnectionPool(host=‘localhost’, port=8000): Read timed out. (read timeout=200)

Looking forward to your assistance and thanking you for the prompt responses.

Regards

Fix applied in the huggingface text_generation_server project. Looks like container rebuilds pull the fix. Remove stray `quantize` argument in `get_weights_col_packed_qkv` by danieldk · Pull Request #2237 · huggingface/text-generation-inference · GitHub

microsoft/Phi-3-mini-128-instruct not loading:

I cleared the cache and built a new container image to pick up the latest huggingface update.

  1. Searches run fine if the vector database is disabled.
  2. Searches always throws an error if the vector database is enabled.

Problem report moved to GitHub Searches all fail if vector database enbabled · Issue #13 · NVIDIA/workbench-example-hybrid-rag · GitHub

Updated the PR on how I was able to get a new container image working by changing a lib version number.

For others. On my AMD Ryzen system

  1. Uploading a single 1.5GB document took 21 seconds to index
  2. Uploading a single 5.3GB document timed out at 360 seconds, running with an extended timeout

Indexing a document seems to be a call to a post to http:///uploadDocument . Where are the logs for that service?

Yes I know it is a demo but I wish the vector DB ran in a separate process like it looks like it did in a previous container