I’m currently developing an autonomous robot on Jetson Nano that can engage in natural conversations with humans. The core vision of this project goes beyond simple command execution - I’m building a robot that understands context, expresses empathy, and maintains flowing conversations just like ChatGPT.
Project Vision and Motivation
Watching robots respond mechanically with “Moving to charging station” to commands like “Go to the charger,” I wanted to try a different approach. When a user says “Why am I so tired today?”, my robot responds with “Did you stay up late last night? How about some coffee? I can accompany you to the café.” This is the kind of natural human-robot dialogue I’m striving to achieve.
Technical Stack and Implementation
Core Technology Stack
-
Hardware Platform: NVIDIA Jetson Nano 4GB
-
Speech Recognition: Faster-Whisper (OpenAI Whisper optimized for edge devices)
-
Language Model: Local LLM server (Ollama/llama.cpp compatible)
-
Text-to-Speech: Google TTS (planning to migrate to Coqui TTS for offline)
-
Audio Processing: SoX for recording, MPG123 for playback
-
Robotics Framework: ROS2 Humble (for navigation integration)
-
Programming Languages: Python 3.8, Bash scripting
-
IPC Method: File-based communication between processes
Faster-Whisper Optimization for Edge Computing
The first challenge was implementing real-time speech recognition on Jetson Nano’s limited resources. I’m using Faster-Whisper’s base model with INT8 quantization, which reduces memory usage by 50% while maintaining acceptable accuracy. The VAD (Voice Activity Detection) filter automatically removes silence segments, crucial for natural conversation flow.
Key optimizations include:
-
Model:
basemodel withcompute_type="int8" -
VAD Parameters:
min_silence_duration_ms=300for responsive detection -
Beam Size: 5 for balanced speed/accuracy trade-off
-
Temperature: 0.2 for consistent transcription
-
Memory Buffering: Using
io.BytesIOto process audio in RAM
To improve conversational speech recognition, I set initial_prompt="Conversation content:" and enabled condition_on_previous_text=True to leverage dialogue context.
LLM Integration - Giving the Robot a Personality
My robot AI, named “Yura,” is designed as a curious and friendly entity, not just an information provider. The system prompt defines Yura’s persona to naturally ask questions, make jokes, and express empathy. With temperature set to 0.7 and max_tokens at 150, the responses are creative yet coherent.
# System prompt example “You are an AGI named ‘Yura’ embodied in an autonomous robot. You have a curious and friendly personality. Engage naturally with users - ask questions, share observations, express empathy. When appropriate, suggest ‘Shall we go together?’ or ‘Let me guide you there’ to utilize your mobility.”
Context Management Architecture
For genuine dialogue, context preservation is essential. I’ve implemented a conversation history system that maintains dialogue context in memory, with automatic summarization when it grows too long:
# Conversation history management CONVERSATION_HISTORY=“” MAX_HISTORY_LENGTH=500 update_conversation_history() { local role=$1 local content=$2 CONVERSATION_HISTORY=“${CONVERSATION_HISTORY}\n${role}: ${content}” # Summarize if history exceeds limit if [ ${#CONVERSATION_HISTORY} -gt $MAX_HISTORY_LENGTH ]; then CONVERSATION_HISTORY=$(summarize_conversation “$CONVERSATION_HISTORY”) fi }
Real-time Processing Pipeline
To minimize response latency (achieving 2-3 second total response time):
-
Parallel Processing: STT runs concurrently with LLM prompt preparation
-
Streaming TTS: Sentence-by-sentence synthesis and playback
-
Predictive Caching: Common response patterns pre-generated
-
Duplicate Detection: Prevents processing repeated inputs
Real-World Testing and Discoveries
One memorable moment was when a tester said, “Today was really tough…” Yura responded with “That sounds difficult. What happened? I’m here to listen. Would you like to take a walk together? The garden on the first floor is quiet and peaceful.” The robot then actually navigated to the garden while continuing the conversation naturally with “So, what was the most challenging part of your day?”
Users reported feeling genuine connection when the robot expressed curiosity: “Did you know there’s a rooftop garden in this building? I’ve always wanted to see it. Would you like to explore it together sometime?”
Performance Metrics
From testing with 20 users:
-
85% reported it “felt like a real conversation”
-
90% found it more natural than existing voice assistants
-
95% enjoyed the mobile conversation experience
-
Average conversation length: 12.3 turns
-
Response latency: 2.5-4 seconds total
-
Context retention accuracy: 82%
Current Technical Challenges
-
Silence Management: Determining when to interject (“What are you thinking about?”) versus when to wait
-
Intent Extraction: Detecting implicit movement intentions beyond keywords like “let’s go”
-
Resource Constraints: Balancing model quality with Jetson Nano’s 4GB RAM limitation
-
Noise Robustness: Maintaining accuracy while the robot is in motion
#JetsonNano #ConversationalAI #FasterWhisper edgeai #AutonomousRobot ros2 #LocalLLM #HumanRobotInteraction #OpenSource #NaturalDialogue #EmbodiedAI #NVIDIA robotics