I’m running some experiments with semantic segmentation models on a Nano. One of the models I am using is DeepLabv3 with a MNv2 backbone, trained on the ADE20K dataset.
I noticed that the performance I was getting is very poor (~720ms per inference), on par with what I get on my five-year-old laptop running TensorFlow on CPU.
I debugged the issue and traced it to the memory copy, that is, EagerTensor.numpy()
. Inference is very fast, as expected, but then converting the results to a format I can use in the rest of my workflow takes MUCH longer (up to 20x inference time if I use heavier networks like PSPNet).
I tried using TF1.x semantics with tf.Session()
but no dice, tf.Session.run()
becomes the bottleneck with similar timings. I tried converting the model to TensorRT… which makes the model even slower! (~920ms per inference)
Now, normally I would just say “oh well, memory copy is slow, duh” and try to find another way around it. But isn’t the Nano supposed to share memory between its CPU and GPU? Is there any way to exploit this peculiarity?
I am running TF2.1 on JP4.3.