Riva and Triton performance testing recommended process

Hardware - GPU (T4)
Hardware - CPU
Operating System - Amazon Linux2
Riva Version - 1.9.0-beta
TLT Version (if relevant)

Currently we are deploying Riva and Triton inference server separately on AWS EKS on g4dn.xlarge node.


During the tests the cluster was having either Riva or Triton.
We have observed the TTS process time with the following phrase:

Hello Mr. Yellow, Let’s go for a boat ride to Holland’s Cove

Service Name Avg. Time Model Used
Riva 1.64s Riva deployed with default models from NGC
Triton 1.84s Triton deployed with custom models

The tests were performed from inside the same AWS data center. So, the network latency was minimal (around 0.01s).

It would be helpful, if you could let us know,

  • if there are any standard tool to perform the performance tests
  • if it’s recommended to use jMeter like industry standard tools to use for performance testing as we have seen at https://docs.nvidia.com/deeplearning/riva/user-guide/docs/performance.html, NVIDIA is using some custom tool to measure the performance. Also, we could not find corresponding tool for Riva and Triton.
  • the recommended node type, including GPU model to do performance testing
  • is there any way to debug the step wise time taken for the TTS engine
2 Likes

Not clear how you are setup, riva runs a triton instance in the same container. You should not need to run triton separately. If you are running triton directly with a custom model that isnt Riva - and you should ask in a Triton specific forum for best guidance.

We use the tts perf client. What are you trying to do?

  • the recommended node type, including GPU model to do performance testing

This depends on workload and goals, without knowing more Id suggest to use an A100 based instance if you are purely interested in performance.

  • is there any way to debug the step wise time taken for the TTS engine

Triton has some tooling that may help here, as it looks like you have a triton model and a Riva model and they are different. For Riva specifically its an end to end deployment based on how the pipeline is deployed, can you help me understand which path you are looking at and why/what you would do with this data?

@sjunkin Our usecase is a mobile application requesting for audio from a TTS service (like AWS Polly) corresponding to phrases needed by the application to speak. We want to understand how to size and estimate our servers for a expected concurrency that we want to achieve.

We have used 2 setups as below so far to test.

  1. RIVA on AWS EKS - Riva deployed with default models from NGC
  2. Triton on AWS EKS - Triton deployed with custom models

They are independent and not related. We can focus on the RIVA queries in this forum question.
We deployed the RIVA service on AWS EKS and tried to access the service using a GRPC client (python script and also jMeter).

  1. We understand you have used riva_tts_perf_client to do your benchmarks from within the container, Noted. Is there documentation for this client ?

  2. Our benchmark was to see the results from RIVA service in terms of response time. We used the phrase “Hello Mr. Yellow, Let’s go for a boat ride to Holland’s Cove” on the RIVA service deployed on 1xT4 Tensor Core GPU (and other specs above) which took 1.64 seconds get a response. This has been shared above. We want to understand our observations and get your feedback on the expected performance. It looks like extremely sub-optimal performance to have 1.64 seconds for our simple test case. What can be done to understand the problem and identify a solution ?

  3. While we understand A100 is better performant, we are trying benchmark simplest scenario and try to size based on that.So, on our test case above , what is the expected from a 1xT4 based machine ?

  4. RIVA is our preferred option to deploy, we only turned to Triton to understand the capacity of GPU for a TTS service.