Streaming audio inference issue for audio2gesture

hi there,
I am writing audio stream to machinima using audio2gesture (which is similar to the the streaming example of audio2face), and I found that if the autio is long enouh( in my case the audio length is more than 20 seconds), there could be a action pause during playing as showed in the following video.