Access RAG project endpoint via Python code

Hi, I cloned the Agentic RAG repo and built the project successfully. Using Chat UI, I am able to upload documents and query and get results related to them. Now, I want to access that endpoint via Python code. But I am not able to figure out exactly which endpoint to infer and what parameters to pass. Any suggestions would be very helpful. (Chat UI is running on localhost 10000)

Python code :-

query = “What are the symptoms of dementia?”
response = requests.post(
http://localhost:10000/projects/agentic-rag/applications/chat/”, # or /chat or whatever endpoint your RAG server is exposing
json={“query”: query}
)
print(response.json())

Agentic RAG repo - GitHub - ananaygupta70/workbench-example-agentic-rag: An NVIDIA AI Workbench example project for an Agentic Retrieval Augmented Generation (RAG)

Please tick the appropriate box to help us categorize your post
Bug or Error
Feature Request
Documentation Issue
Other

Hi ananaygupta70, thanks for reaching out.

That project utilizes endpoints from build.nvidia.com for inference. There is a locally running vector db it retrieves context from and sends the query to the build.nvidia.com hosted endpoints to generate the answer. This means there is no locally running inference server component to ping on localhost.

What you can do is hit those same endpoints the project is using via python code. The model cards on Build provide boilerplate code for you to do this, eg. for Llama 3.1 70b instruct:

from openai import OpenAI

client = OpenAI(
base_url = “https://integrate.api.nvidia.com/v1”,
api_key = “$API_KEY_REQUIRED”
)

completion = client.chat.completions.create(
model=“meta/llama-3.1-70b-instruct”,
messages=[{“role”:“user”,“content”:“”}],
temperature=0.2,
top_p=0.7,
max_tokens=1024,
stream=True
)

for chunk in completion:
if chunk.choices[0].delta.content is not None:
print(chunk.choices[0].delta.content, end=“”)

You can also hit these endpoints with a Langchain wrapper, eg.

from langchain_nvidia_ai_endpoints import ChatNVIDIA

client = ChatNVIDIA(
model=“meta/llama-3.1-70b-instruct”,
api_key=“$API_KEY_REQUIRED_IF_EXECUTING_OUTSIDE_NGC”,
temperature=0.2,
top_p=0.7,
max_tokens=1024,
)

for chunk in client.stream([{“role”:“user”,“content”:“”}]):
print(chunk.content, end=“”)

Hope this helps clarify things!