We want to implement an LSTM on an embedded device with the following keras structure:
Model: “model”
Layer (type) Output Shape Param #
input_1 (InputLayer) [(1, 800, 2)] 0
conv1d (Conv1D) (1, 289, 32) 32800
conv1d_1 (Conv1D) (1, 162, 32) 131104
conv1d_2 (Conv1D) (1, 99, 32) 65568
lstm (LSTM) (1, 16) 3136
dense (Dense) (1, 128) 2176
dense_1 (Dense) (1, 128) 16512
dense_2 (Dense) (1, 104) 13416
Total params: 264,712
Trainable params: 264,712
Non-trainable params: 0
This Keras model was converted to TensorflowLite (.tflite) and quantized with full-integer post-training quantization. As a result, all model weights and activation outputs are 8-bit integer data.
We implemented the net on a google Coral TPU. On a desktop application the runtime of the net was 7 ms. On our embedded system nearly 60 ms! This is way too long. Issue seems to be the LSTM layer.
with LSTM-Layer
mean: 63.2425 ms
min: 53.3066 ms
max: 79.416 ms
without LSTM-Layer
mean: 6.85505 ms
min: 3.44741 ms
max: 17.8879 ms
We want an AI accelerator connected to our embedded platform. I2S, USB or ethernet connected work all fine. We just want the runtime to be predictable under 10 ms and have easy access over an embedded linux system (yocto setup). What options does Nvidia offer? Do you face similar challenges with LSTM layers?