I am using AWS Greengrass to deploy a total of 5 models to a Jetson NX predicting specific attributes (age, gender, etc). The general idea is to extract faces and subsequently predict with four models on every crop.
All models are compiled using AWS Sagemaker Neo (four Pytorch models, one tflite model). I am using Python, loading images with openCV.
Inference is accomplished by first loading all models via the pb2 grpc agent and sending a predict request, following this example Build machine learning at the edge applications using Amazon SageMaker Edge Manager and AWS IoT Greengrass V2 | AWS Machine Learning Blog.
However, during runtime GPU utilization is at around 30%, having a lot of downtime in between.
I tried to overcome that by using multithreading. I tested with a threadpool using concurrent.futures by assigning a thread to every model and in another try assigning a thread to the inference workflow for each detected face. Both with no success.