@siyu_ok I would use Agent Studio to setup the pipeline and visually inspect what is happening and independently test the ASR, LLM, TTS. You can also manually run some of the tests under nano_llm/test to confirm the ASR and TTS functionality first:
@dusty_nv Thank you for your information! I modified the pipeline of voice_chat, and now it worked. But a exception raised after a few conversations:
Exception in thread Thread-2 (_run):
Traceback (most recent call last):
File “/usr/lib/python3.10/threading.py”, line 1016, in _bootstrap_inner
self.run()
File “/usr/lib/python3.10/threading.py”, line 953, in run
self._target(*self._args, **self._kwargs)
File “/opt/NanoLLM/nano_llm/models/mlc.py”, line 529, in _run
self._generate(stream)
File “/opt/NanoLLM/nano_llm/models/mlc.py”, line 507, in _generate
prefill(self.embed_tokens([self.tokenizer.eos_token_id], return_tensors=‘tvm’), stream.kv_cache)
File “/opt/NanoLLM/nano_llm/models/mlc.py”, line 283, in embed_tokens
raise RuntimeError(f"{self.config.name} does not have embed() in {self.module_path}")
RuntimeError: phi-2 does not have embed() in /data/models/mlc/dist/phi-2-ctx2048/phi-2-q4f16_ft/phi-2-q4f16_ft-cuda.so
Hi @siyu_ok, does this only occur after the chat history fills up, or is it with fresh chat? Can you try changing the --max-context-len to see if that alters the behavior?
@dusty_nv the same issue occurred with --max-context-len=512.
So I tested the Llama-3-8B-Instruct, it worked well with no errors, so I think it’s probably a model-related problem.
OK gotcha @siyu_ok , thanks for letting me know. In that case, you might want to try it with a different LLM backend (like --api=hf). I have been meaning to upgrade the version of MLC/TVM this uses to pick up the latest fixes in that.