During the tests the cluster was having either Riva or Triton.
We have observed the TTS process time with the following phrase:
Hello Mr. Yellow, Let’s go for a boat ride to Holland’s Cove
Service Name
Avg. Time
Model Used
Riva
1.64s
Riva deployed with default models from NGC
Triton
1.84s
Triton deployed with custom models
The tests were performed from inside the same AWS data center. So, the network latency was minimal (around 0.01s).
It would be helpful, if you could let us know,
if there are any standard tool to perform the performance tests
if it’s recommended to use jMeter like industry standard tools to use for performance testing as we have seen at https://docs.nvidia.com/deeplearning/riva/user-guide/docs/performance.html, NVIDIA is using some custom tool to measure the performance. Also, we could not find corresponding tool for Riva and Triton.
the recommended node type, including GPU model to do performance testing
is there any way to debug the step wise time taken for the TTS engine
Not clear how you are setup, riva runs a triton instance in the same container. You should not need to run triton separately. If you are running triton directly with a custom model that isnt Riva - and you should ask in a Triton specific forum for best guidance.
if there are any standard tool to perform the performance tests
The client docker ( nvcr.io/nvidia/riva/riva-speech-client:1.9.0-beta) has riva_tts_perf_client this is how we generated our performance numbers.
if it’s recommended to use jMeter like industry standard tools to use for performance testing as we have seen at Performance — NVIDIA Riva Speech Skills v1.9.0-beta documentation, NVIDIA is using some custom tool to measure the performance. Also, we could not find corresponding tool for Riva and Triton.
We use the tts perf client. What are you trying to do?
the recommended node type, including GPU model to do performance testing
This depends on workload and goals, without knowing more Id suggest to use an A100 based instance if you are purely interested in performance.
is there any way to debug the step wise time taken for the TTS engine
Triton has some tooling that may help here, as it looks like you have a triton model and a Riva model and they are different. For Riva specifically its an end to end deployment based on how the pipeline is deployed, can you help me understand which path you are looking at and why/what you would do with this data?
@sjunkin Our usecase is a mobile application requesting for audio from a TTS service (like AWS Polly) corresponding to phrases needed by the application to speak. We want to understand how to size and estimate our servers for a expected concurrency that we want to achieve.
We have used 2 setups as below so far to test.
RIVA on AWS EKS - Riva deployed with default models from NGC
Triton on AWS EKS - Triton deployed with custom models
They are independent and not related. We can focus on the RIVA queries in this forum question.
We deployed the RIVA service on AWS EKS and tried to access the service using a GRPC client (python script and also jMeter).
We understand you have used riva_tts_perf_client to do your benchmarks from within the container, Noted. Is there documentation for this client ?
Our benchmark was to see the results from RIVA service in terms of response time. We used the phrase “Hello Mr. Yellow, Let’s go for a boat ride to Holland’s Cove” on the RIVA service deployed on 1xT4 Tensor Core GPU (and other specs above) which took 1.64 seconds get a response. This has been shared above. We want to understand our observations and get your feedback on the expected performance. It looks like extremely sub-optimal performance to have 1.64 seconds for our simple test case. What can be done to understand the problem and identify a solution ?
While we understand A100 is better performant, we are trying benchmark simplest scenario and try to size based on that.So, on our test case above , what is the expected from a 1xT4 based machine ?
RIVA is our preferred option to deploy, we only turned to Triton to understand the capacity of GPU for a TTS service.