LSTM Embedded Accelerator

We want to implement an LSTM on an embedded device with the following keras structure:
Model: “model”


Layer (type) Output Shape Param #

input_1 (InputLayer) [(1, 800, 2)] 0


conv1d (Conv1D) (1, 289, 32) 32800


conv1d_1 (Conv1D) (1, 162, 32) 131104


conv1d_2 (Conv1D) (1, 99, 32) 65568


lstm (LSTM) (1, 16) 3136


dense (Dense) (1, 128) 2176


dense_1 (Dense) (1, 128) 16512


dense_2 (Dense) (1, 104) 13416

Total params: 264,712
Trainable params: 264,712
Non-trainable params: 0

This Keras model was converted to TensorflowLite (.tflite) and quantized with full-integer post-training quantization. As a result, all model weights and activation outputs are 8-bit integer data.

We implemented the net on a google Coral TPU. On a desktop application the runtime of the net was 7 ms. On our embedded system nearly 60 ms! This is way too long. Issue seems to be the LSTM layer.

with LSTM-Layer

mean: 63.2425 ms
min: 53.3066 ms
max: 79.416 ms

without LSTM-Layer

mean: 6.85505 ms
min: 3.44741 ms
max: 17.8879 ms

We want an AI accelerator connected to our embedded platform. I2S, USB or ethernet connected work all fine. We just want the runtime to be predictable under 10 ms and have easy access over an embedded linux system (yocto setup). What options does Nvidia offer? Do you face similar challenges with LSTM layers?

One of the Jetson devices may be a possibility. You can connect to them over ethernet, certainly. To get best inference performance, you would want to experiment with using TensorRT (which is certainly possible on Jetson). There are separate forms for Jetson discussions as well as TensorRT usage.

NVIDIA submits various Jetson devices to a variety of MLPerf benchmarking, those results can be found with a quick google search, (here is an example) or directly on the MLCommons website.

Dear @Robert_Crovella,

thanks for the feedback. Can you share benchmark parameters of a modell using LSTMs or Gru? Is the compiler capable to accelerate the LSTMs / Gru?

Best regards

Marc

Yes, TensorRT can be used with LSTM. (see here.)

I don’t have a comprehensive LSTM benchmarking document. Google is your friend. Here is an example.

Detailed questions about TRT usage should be directed to the relevant forum. You may get better guidance there, about LSTM usage and performance.