I am at stage 0 (research) of trying to figure out how many audio streams a Jetson TX2 could simultaneously turn to text. I’m assuming my performance limitation will cap the number, but I cannot seem to find any benchmarks for common speech recognition algorithms like I can for the image recognition algorithms on page 18 here: https://images.nvidia.com/content/pdf/inference-technical-overview.pdf
Has this been done, and if so, can somebody point me to the work?
Is there any reason why simultaneous streams could not be interpreted?
Thanks!