We are running a few fine-tuned Citrinet models in riva-speech:1.7.0-beta-server in k8s, and we’ve been unable to load the models in Riva-speech server after modifying riva-build and riva-deploy pipeline - the speech server fails with this error: "Unable to create TensorRT engine”.
Some context on riva-build and riva-deploy pipeline:
Originally, we run the “riva-build” and “riva-deploy” commands of riva-speech:1.7.0-beta-servicemaker on a separate VM and then copied the exported models over to where the Riva-speech server running in k8s could access them.
Recently we re-created riva-build and riva-deploy parts of this as a kubeflow pipeline, and since then Riva-speech server is unable to load the models, failing with this error:
E1125 00:08:13.321963 22 model_repository_manager.cc:1215] failed to load ‘riva-trt-custom-model-am-streaming-offline’ version 1: Internal: unable to create TensorRT engine
The only warnings in riva-deploy logs in kubeflow that are not present in riva-deploy logs on the VM are these:
[TensorRT] WARNING: Half2 support requested on hardware without native FP16 support, performance will be negatively affected.
[TensorRT] WARNING: Convolution + generic activation fusion is disable due to incompatible driver or nvrtc
GPU, OS and Riva version are the same everywhere:
Hardware - GPU: T4
Operating System - Ubuntu 20.04
Riva Version: 1.7.0
There’s some difference in driver versions but they don’t exactly explain why 2+1 works and 3+1 does not:
-
Riva-speech server in k8s:
±----------------------------------------------------------------------------+
| NVIDIA-SMI 450.119.04 Driver Version: 450.119.04 CUDA Version: 11.4 |
|-------------------------------±---------------------±---------------------+ -
VM with riva-build & riva-deploy (output works in Riva-speech server):
±----------------------------------------------------------------------------+
| NVIDIA-SMI 495.29.05 Driver Version: 495.29.05 CUDA Version: 11.5 |
|-------------------------------±---------------------±---------------------+ -
Kubeflow pipeline with riva-build and riva-deploy (output doesn’t work in Riva-speech server):
±----------------------------------------------------------------------------+
| NVIDIA-SMI 450.119.04 Driver Version: 450.119.04 CUDA Version: 11.4 |
|-------------------------------±---------------------±---------------------+
It has been suggested that we try to downgrade cuda version in k8s cluster where kubeflow is running. Unfortunately, according to documentation, cuda 11.0 is the latest supported version (Run GPUs in GKE Standard node pools | Google Kubernetes Engine (GKE) | Google Cloud).
What I did instead was to set up another VM with an earlier version of cuda drivers (460.91.03) and run riva-build and riva-deploy there, and here are the results:
- Riva-deploy pipeline still shows the incompatible driver warning, same as when we run it in kubeflow:
[TensorRT] WARNING: Half2 support requested on hardware without native FP16 support, performance will be negatively affected.
[TensorRT] WARNING: Convolution + generic activation fusion is disable due to incompatible driver or nvrtc
- the model exported on this VM works in Riva-speech server, despite the differences in cuda versions.
I’ve also noticed that the model that doesn’t work in Riva-speech server is twice as big (543Mb) vs the one that works (274Mb).
To sum it up:
-
riva-build/deploy with cuda 11.5 and 495 drivers on VM → works in riva speech server w/cuda 11.4 and 450 drivers
-
riva-build/deploy with cuda 11.4 and 460 drivers on VM → works in riva speech server w/cuda 11.4 and 450 drivers
-
riva-build/deploy in kubeflow, with cuda 11.4 and 450 drivers (running on k8s nodes that only supports cuda 11.0) → does NOT work in riva speech server w/cuda 11.4 and 450 drivers
-
latest supported cuda version in kubernetes is 11.0 - however, Riva-speech server running in k8s has no problems loading models exported on VMs with newer cuda driver versions (460 and 495)
Is there anything you could recommend we do to make it work in k8s & kubeflow with cuda 11.0?