"Unable to create TensorRT engine" when loading models in riva-speech:1.7.0-beta-server

We are running a few fine-tuned Citrinet models in riva-speech:1.7.0-beta-server in k8s, and we’ve been unable to load the models in Riva-speech server after modifying riva-build and riva-deploy pipeline - the speech server fails with this error: "Unable to create TensorRT engine”.

Some context on riva-build and riva-deploy pipeline:

Originally, we run the “riva-build” and “riva-deploy” commands of riva-speech:1.7.0-beta-servicemaker on a separate VM and then copied the exported models over to where the Riva-speech server running in k8s could access them.

Recently we re-created riva-build and riva-deploy parts of this as a kubeflow pipeline, and since then Riva-speech server is unable to load the models, failing with this error:

E1125 00:08:13.321963 22 model_repository_manager.cc:1215] failed to load ‘riva-trt-custom-model-am-streaming-offline’ version 1: Internal: unable to create TensorRT engine

The only warnings in riva-deploy logs in kubeflow that are not present in riva-deploy logs on the VM are these:

[TensorRT] WARNING: Half2 support requested on hardware without native FP16 support, performance will be negatively affected.
[TensorRT] WARNING: Convolution + generic activation fusion is disable due to incompatible driver or nvrtc

GPU, OS and Riva version are the same everywhere:

Hardware - GPU: T4
Operating System - Ubuntu 20.04
Riva Version: 1.7.0

There’s some difference in driver versions but they don’t exactly explain why 2+1 works and 3+1 does not:

  1. Riva-speech server in k8s:
    ±----------------------------------------------------------------------------+
    | NVIDIA-SMI 450.119.04 Driver Version: 450.119.04 CUDA Version: 11.4 |
    |-------------------------------±---------------------±---------------------+

  2. VM with riva-build & riva-deploy (output works in Riva-speech server):
    ±----------------------------------------------------------------------------+
    | NVIDIA-SMI 495.29.05 Driver Version: 495.29.05 CUDA Version: 11.5 |
    |-------------------------------±---------------------±---------------------+

  3. Kubeflow pipeline with riva-build and riva-deploy (output doesn’t work in Riva-speech server):
    ±----------------------------------------------------------------------------+
    | NVIDIA-SMI 450.119.04 Driver Version: 450.119.04 CUDA Version: 11.4 |
    |-------------------------------±---------------------±---------------------+

It has been suggested that we try to downgrade cuda version in k8s cluster where kubeflow is running. Unfortunately, according to documentation, cuda 11.0 is the latest supported version (Run GPUs in GKE Standard node pools  |  Google Kubernetes Engine (GKE)  |  Google Cloud).

What I did instead was to set up another VM with an earlier version of cuda drivers (460.91.03) and run riva-build and riva-deploy there, and here are the results:

  1. Riva-deploy pipeline still shows the incompatible driver warning, same as when we run it in kubeflow:

[TensorRT] WARNING: Half2 support requested on hardware without native FP16 support, performance will be negatively affected.
[TensorRT] WARNING: Convolution + generic activation fusion is disable due to incompatible driver or nvrtc

  1. the model exported on this VM works in Riva-speech server, despite the differences in cuda versions.

I’ve also noticed that the model that doesn’t work in Riva-speech server is twice as big (543Mb) vs the one that works (274Mb).

To sum it up:

  1. riva-build/deploy with cuda 11.5 and 495 drivers on VM → works in riva speech server w/cuda 11.4 and 450 drivers

  2. riva-build/deploy with cuda 11.4 and 460 drivers on VM → works in riva speech server w/cuda 11.4 and 450 drivers

  3. riva-build/deploy in kubeflow, with cuda 11.4 and 450 drivers (running on k8s nodes that only supports cuda 11.0) → does NOT work in riva speech server w/cuda 11.4 and 450 drivers

  4. latest supported cuda version in kubernetes is 11.0 - however, Riva-speech server running in k8s has no problems loading models exported on VMs with newer cuda driver versions (460 and 495)

Is there anything you could recommend we do to make it work in k8s & kubeflow with cuda 11.0?

Hi @darya.trofimova,

I think below issue is due to older Nvidia driver version. GKE node images currently use Nvidia driver version 450.119.04 which is compatible with CUDA 11.0. ( The latest supported CUDA version is 11.0 on both COS (1.18.6-gke.3504+) and Ubuntu (1.19.8-gke.1200+)

Please refer to below application consideration link:
CUDA Compatibility :: NVIDIA Data Center GPU Driver Documentation.

May be you can try deploying the solution to AWS EKS cluster. You can refer to below Riva doc for more details:
https://docs.nvidia.com/deeplearning/riva/user-guide/docs/samples/rivaasreks.html#

AWS has option for latest Nvidia driver version as well.

While executing step 4 in below section
https://docs.nvidia.com/deeplearning/riva/user-guide/docs/samples/rivaasreks.html#defining-and-launching-the-eks-cluster
Please use following helm install command instead. It’s a know documentation issue, which we are trying to correct as soon as possible:

helm install --namespace riva riva . --set ngcCredentials.password=`echo -n $NGC_API_KEY | base64 -w0` --set modelRepoGenerator.modelDeployKey=`echo -n tlt_encode | base64 -w0` --set riva.speechServices.asr=true --set riva.speechServices.tts=true --set riva.speechServices.nlp=true

I hope this helps you proceed with deployment of RIVA on Kubernetes cluster.

Thanks

Hi @darya.trofimova,

Reconsidering the error mentioned above, it seems error might be due to the variation in TRT version used to generate the model after fine tuning and the TRT version used in the Triton server on the backend.

Could you please check the TRT version used in the server backend, and update the TRT version used in the TAO conversion to regenerate the model engine to be compatible?

As per RIVA software compatibility - * TensorRT [8.0.1.6] is used in backend service.
https://docs.nvidia.com/deeplearning/riva/user-guide/docs/support-matrix.html#id15

May be TAO Toolkit 3.0-21.11 for x86 + GPU - CUDA 11.3 / cuDNN 8.1 / TensorRT 8.0 version package can be used here.

Based on the version deployed at your end, please map the TAO toolkit version accordingly and let us know in case issue persist.

Regards,
Sunil Kumar

Hi @ SunilJB,

I’ve tried matching tensorRT versions everywhere as you suggested.

Original TensorRT versions are as follows:

  1. Tao finetuning container: 7.2.3-1
  2. Riva 1.7 servicemaker (riva build/deploy pipeline): 8.0.1-1
  3. Riva 1.7 server (running in k8s): 8.0.1-1

The solution that I’ve tried looked like this:

  1. Upgrade TRT in tao container from 7.xxx to 8.xxx
  2. Run riva service maker (build/deploy pipeline) with trt 8.xxx
  3. Load the models in riva server with trt 8.xxx

This still results in the same error as before: "Unable to create TensorRT engine”

I’ve also tried:

  • downgrading TRT version in riva servicemaker/service, but that didn’t work out as these docker images are distributed with ubuntu 20.04, and I was not able to find TRT 8.xxx compatible with ubuntu20.04.
  • running everything with riva 1.8 (also didn’t help)

riva-deploy users tensorRT to make a plan file optimized for the GPU present on the system it is executed against. If it was working before as you said outside of kubeflow pipeline and now it isnt working in your pipeline AND you are getting gpu and driver compatibility issues could be configuration/version issue with kubeflow. The Tao finetuning step shouldnt impact this part of the process.

Which version of TRT did you use to produce the model? The version of TRT in Triton needs to match the TRT version used to produce the model, which is the source of this error usually:
E1125 00:08:13.321963 22 model_repository_manager.cc:1215] failed to load ‘riva-trt-custom-model-am-streaming-offline’ version 1: Internal: unable to create TensorRT engine