Jetson AI Lab - Agent Controller LLM

JETSON AI LAB RESEARCH GROUP

  • Project - Agent Controller LLM
  • Team Leads - @dusty_nv, Akash James, REBOTNIX

This project is to integrate a higher-level conversational LLM for interfacing with the user (either via text input or ASR from microphone) and to dynamically task/reconfigure the agent pipeline based on user commands and queries.

For example, the user should be able to say things like “if you see the door open, send me an alert” - and the LLM will output code to prompt a multimodal vision model, followed by the hooks for event detection and actions/alerts. Or, “hey robot, follow me.” and the robot’s perception & navigation system will begin tracking the person.

Current vision/language models (VLM) like Llava are not as conversational in nature as text-based LLM’s like Llama are, and may represent just one possible domain expert ‘worker model’ that the controller agent can invoke (including ViT’s like OwlVIT for open-vocabulary object detection, ect)

Further, having such a higher-level controller agent in place can lead to more adaptive intelligent behaviors of the system. Akash James and Gary Hilgemann (REBOTNIX) have independently had encouraging experiences with these multi-model dynamic agent architectures that deem further investigation.

There will be lower-level features in the LLM generation API needed to accomplish this, including function-calling (or ‘tools’ as they are referred to in the OpenAI ecosystem). Function descriptions available to the bot to invoke are embedded in the system prompt (normally using JSON format for parameter consistency) - for example, IMAGE_QUERY() DETECT_OBJECT() SEARCH_VECTORDB() GET_TIME() PERFORM_ACTION(), ect. And then when the LLM determines it necessary, it will output JSON or Python code invoking one or multiple of these functions/plugins.

Initial experiments accessing Llama’s ability to situationally call these are also encouraging - the part that remains is integration at the generation-level so that when the LLM actually outputs the code snippets, these are detected mid-generation, run, and the results injected into the bot output (for example, the time or result of search query). At that point, LLM output generation continues, with the bot now having knowledge of the result since it is now included in the prior context.

There are other prompt engineering techniques to experiment with in this realm as well for building more complex agents, such as auto-prompting, chain-of-thought (CoT), and guidance/grammars for constrained output. There are many projects having explored these with Langchain, Microsoft Jarvis, BabyAGI, ect that we can share techniques from.

Our remit is optimized integration of such techniques for building an adaptive assistant providing low-latency, responsive user experiences via vision and verbal conversations that can intuitively learn and be customized for each user.

6 Likes

Self-Learning Llama-3 Voice Agent with Function Calling and Automatic RAG

Enable the LLM (Meta-Llama-3-8B) to invoke Python functions you give it access to, including the ability to save/retrieve info that it learns about you over time. Run locally on Jetson Orin, using Llama-3-8B-Instruct, Riva ASR, and Piper TTS through NanoLLM.

See the docs for function calling here: Chat — NanoLLM 24.4.2 documentation

2 Likes