.numpy() is very slow on Jetson Nano

I’m running some experiments with semantic segmentation models on a Nano. One of the models I am using is DeepLabv3 with a MNv2 backbone, trained on the ADE20K dataset.

I noticed that the performance I was getting is very poor (~720ms per inference), on par with what I get on my five-year-old laptop running TensorFlow on CPU.

I debugged the issue and traced it to the memory copy, that is, EagerTensor.numpy(). Inference is very fast, as expected, but then converting the results to a format I can use in the rest of my workflow takes MUCH longer (up to 20x inference time if I use heavier networks like PSPNet).

I tried using TF1.x semantics with tf.Session() but no dice, tf.Session.run() becomes the bottleneck with similar timings. I tried converting the model to TensorRT… which makes the model even slower! (~920ms per inference)

Now, normally I would just say “oh well, memory copy is slow, duh” and try to find another way around it. But isn’t the Nano supposed to share memory between its CPU and GPU? Is there any way to exploit this peculiarity?

I am running TF2.1 on JP4.3.


Could you share a simple code for us checking?
More, could you try to enable TensorRT to see if helps also?


I mentioned in the OP that I tried converting to TRT and that made the model even slower.
To test, download DeepLabv3, convert to SavedModel using this script, then run

import tensorflow as tf
import numpy as np
import time
imported = tf.saved_model.load('deeplabv3_mnv2_ade20k_train_2018_12_03_saved')
inp = np.random.uniform(0, 255, [1, 513, 513, 3])
t0 = time.time(); imported.signatures['serving_default'](tf.cast(inp, tf.uint8)); t1 = time.time(); print((t1 - t0) * 1000)


Sorry for the late reply.

Would you mind to check the memory status when running the script?
If the memory also reaches the limitation, it may have some impact on the performance.