Speech recognition in TensorRT5

Hi, i want to accelerate Deepspeech2 (CNN+RNN+FC) with TRT5, and i have some questions:

  1. I found RNN layers in TRT only support tensorflow model, is it right? If so, is it necessary to train the network with tensorflow again?
  2. I read the CharRNN sample and it need to dump TF weights, create RNN layers with parameters, convert TF weights to TRT weights and other things, which is completely different from directly converting a CNN model by parsing the model and converting to TRT engine with several APIs. So I wonder if it is necessary to convert the RNN network like this, which seems like complex?
  3. Since the input shapes of CNN and RNN are different, can i build the network (CNN+RNN) in one TRT engine or two separate engines? And how can I implement it?
    Thanks a lot.

Hello,

  1. No. TRT can accept trained models from in .pb, onnx, and several other framework formats. But you’ll have to extract weights from model and convert to TRT weights too.

  2. This is because weights for each gate/layer need to be set separately for the RNN layer. TensorFlow weights are exported with each layer concatenated into a single WTS file. This example starts with a model trained in TensorFlow, a similar workflow should work to bring in weights from any framework of your choice.

  3. I’m not familiar with Deepspeech2. Assuming your convolutional network extracts features and then feeds it to an LSTM/rnn cell, I think you have to reshape the CNN output into a time series sequence. Basically, to connect CNN with LSTM , the CNN output need to be distributed across time. So… I think you can and should build one TRT engine.

+1

Thanks for your patient reply, but i am still confused how to implement it?
Suppose that the network is 2 CNN layers + reshape layers + two RNN layers, where all ops are supported by TRT and the network is trained in Tensorflow. Then when i want to convert the .uff file to TRT engine, which one do i need to do?
(1) creating the network definition from scratch using the TRT’s API like network->addInput, network->add_convolution, network->add_pooling, network->addRNNv2 like 2, and load convert weights from TF model to TRT layers.
(2) just directly use the UffParser and convertor API :
engine = trt.utils.uff_to_trt_engine(G_LOGGER, uff_model, parser, 1, 1 << 20)

I think you can just go with #2.

ok, I see, thanks a lot.
But Deepspeech2 contains several unsupported operations by TRT. Thus i am going to go with #1. Then i have some problems:

  1. How to generate the WTS file? The TRT sample just loads weights from ‘xxx.wts’ file, but don’t show how to generate it?
  2. I found both CNN and RNN samples of creating network definition and loading weights are implemented by C++ API, can I create RNN definition and load and convert weights from TF model with TRT Python API?
  3. I only found the python sample of creating CNN definition in network_api_pytorch_mnist, and it looks like very easy. Just load weights with self.network.state_dict() and define the CNN network. But in C++ sample sampleMNISTAPI, it needs to define function loadWeights and .wts file, which is more complex than Python samples. So i wonder the reason is Python API is easier to create the network definition from scratch or it is only easier for pytorch model not TF?

Any suggestions?