LSTM Embedded Accelerator

marc.vonelling · December 1, 2023, 8:54am

We want to implement an LSTM on an embedded device with the following keras structure:
Model: “model”

Layer (type) Output Shape Param #

input_1 (InputLayer) [(1, 800, 2)] 0

conv1d (Conv1D) (1, 289, 32) 32800

conv1d_1 (Conv1D) (1, 162, 32) 131104

conv1d_2 (Conv1D) (1, 99, 32) 65568

lstm (LSTM) (1, 16) 3136

dense (Dense) (1, 128) 2176

dense_1 (Dense) (1, 128) 16512

dense_2 (Dense) (1, 104) 13416

Total params: 264,712
Trainable params: 264,712
Non-trainable params: 0

This Keras model was converted to TensorflowLite (.tflite) and quantized with full-integer post-training quantization. As a result, all model weights and activation outputs are 8-bit integer data.

We implemented the net on a google Coral TPU. On a desktop application the runtime of the net was 7 ms. On our embedded system nearly 60 ms! This is way too long. Issue seems to be the LSTM layer.

with LSTM-Layer

mean: 63.2425 ms
min: 53.3066 ms
max: 79.416 ms

without LSTM-Layer

mean: 6.85505 ms
min: 3.44741 ms
max: 17.8879 ms

We want an AI accelerator connected to our embedded platform. I2S, USB or ethernet connected work all fine. We just want the runtime to be predictable under 10 ms and have easy access over an embedded linux system (yocto setup). What options does Nvidia offer? Do you face similar challenges with LSTM layers?

Robert_Crovella · December 1, 2023, 4:29pm

One of the Jetson devices may be a possibility. You can connect to them over ethernet, certainly. To get best inference performance, you would want to experiment with using TensorRT (which is certainly possible on Jetson). There are separate forms for Jetson discussions as well as TensorRT usage.

NVIDIA submits various Jetson devices to a variety of MLPerf benchmarking, those results can be found with a quick google search, (here is an example) or directly on the MLCommons website.

marc.vonelling · December 4, 2023, 8:35am

Dear @Robert_Crovella,

thanks for the feedback. Can you share benchmark parameters of a modell using LSTMs or Gru? Is the compiler capable to accelerate the LSTMs / Gru?

Best regards

Marc

Robert_Crovella · December 4, 2023, 5:27pm

Yes, TensorRT can be used with LSTM. (see here.)

I don’t have a comprehensive LSTM benchmarking document. Google is your friend. Here is an example.

Detailed questions about TRT usage should be directed to the relevant forum. You may get better guidance there, about LSTM usage and performance.

Topic		Replies	Views
Use TenserRT2.1 for LSTM layer with peephole and projection GPU-Accelerated Libraries	0	507	September 21, 2017
LSTM Error Tensorrt Jetson Orin TensorRT	1	73	August 29, 2024
LSTM ONNX to TensorRT mismatched outputs TensorRT tensorrt	3	1037	September 29, 2022
Pytorch lstm pth->onnx->tensorrt TensorRT	4	1139	October 12, 2021
Parsing Caffe model with LSTM layer TensorRT	6	1360	June 4, 2020
How to transfer LSTM caffemodel to TensorRT weights DeepStream SDK	2	990	May 5, 2020
How to convert keras .h5 model and use it in tensorrt Jetson Nano	8	4461	October 14, 2021
Running LLMs with TensorRT-LLM on Nvidia Jetson AGX Orin Dev Kit Jetson Projects jetson , generative_ai	1	755	December 8, 2024
Optimizing Inference on Large Language Models with NVIDIA TensorRT-LLM, Now Publicly Available Technical Blog	8	1922	January 25, 2024
Nvidia Jetson Orin Nano tensorrt llm Jetson Orin Nano tensorrt , generative_ai	6	399	August 5, 2024

LSTM Embedded Accelerator

Layer (type) Output Shape Param #

dense_2 (Dense) (1, 104) 13416

with LSTM-Layer

without LSTM-Layer

Related topics