Please close all other applications and Press Enter to continue...
Setting Jetson xavier-nx in max performance mode
gpu frequency is set from 114750000 Hz --> to 1109250000 Hz
dla frequency is set from 1100800000 Hz --> to 1100800000 Hz
------------Executing ResNet50_224x224------------
--------------------------
Model Name: ResNet50_224x224
FPS:837.79
--------------------------
Wall Time for running model (secs): 405.63667702674866
In my experience with PyTorch, the very first inference or training run takes longer, I think it is loading in lots of code pages for kernels or something.
Can you try doing a warm-up of say, 100 iterations, before measuring the speed? Also you will want to time more than 20 iterations. It may also be helpful to run sudo jetson_clocks beforehand, if you haven’t already.
We can get roughly 29.76 fps with the following testing code .
def infer():
tic = time.time()
for i in range(100):
model(img)
print("warmup: ", time.time()-tic)
tic = time.time()
for i in range(100):
model(img)
print("test: ", time.time()-tic)
Based on the tegrastats results, pyTorch seems reaches its limitation.
... GR3D_FREQ 99%@1109
This also similar to our benchmark result shared in this page:
To get the performance similar to jetson_benchmark, you will need to convert the model into TensorRT engine first.
Based on the result in torch2trt, the fps can increase from 55.5 (PyTorch) into 312 (TensorRT) on Xavier.
torch2trt listed that fps is 55.5(pytorch) and 312(tensorrt) when datatype is float16 as listed here. But I get
1.fps 46 when I use pytorch with float16.
2. fps 67 when I use tensorrt with float16.
What am I missing here? I am using the following code:
import torch
import timeit
import torchvision.models as models
import numpy as np
from time import time
# from torch2trt import torch2trt
def inference_test():
device = torch.device('cuda:0')
# Create model and input.
model = models.resnet50(pretrained=True).half()
tmp = (np.random.standard_normal([1, 3, 224, 224]) * 255).astype(np.uint8)
# tmp = (np.random.standard_normal([1, 3, 416, 416]) * 255).astype(np.uint8) #mobilenet_v2
# move them to the device
model.eval()
model.to(device)
img = torch.from_numpy(tmp.astype(np.float16)).to(device)
# convert to TensorRT feeding sample data as input
# model_trt = torch2trt(model, [img])
def infer():
with torch.no_grad():
before = time()
outs = model(img)
# outs = model_trt(img)
infer_time = time() - before
return infer_time
print("Running warming up iterations..")
for i in range(0, 100):
infer()
total_times = timeit.repeat(stmt=infer, repeat=1, number=500)
print("Timeit.repeat: ", total_times)
print("FPS: ", 500 / np.array(total_times).mean())
inference_test()
I find it’s not straightforward to deploy torch/detectron2 Faster RCNN/Mask RCNN models on Xavier NX. But I will be waiting for your findings as you said here.
Now I am also checking for other options. I am a bit confused after finding so many options.
I am looking for a pipeline to deploy object detectors/keypoint detectors on my Jetson Xavier NX. My model should be able to process in real-time. I have to run this project for a long time. Later on, I have to deploy an instance segmentation model on HPC with Nvidia cards for another project.
Can you please suggest me which route should I go?
Sorry that we still need some time for the detectron2 model issue.
Will update to Detectron2 Jetson NX once the experiment is done.
For XavierNX, the fastest mode is INT8 but it requires a calibration cache file.
If performance is critical for your usage, it’s recommended to try INT8 with TensorRT.
Here are some status sharing with you.
There are some non-supported layers used in the Detectro2 model, ex. generateProposals, CollectRpnProposals, …, etc.
We are working on adding these layers to our plugin library or onnx parser.
Once it is done, we will update this to the topic.