Hey Nvidia Forum community,
I’m facing a performance discrepancy on the Jetson AGX Orin 32GB Developer Kit board and would love to get your insights on the matter. Specifically, I’ve noticed a significant difference in latency results between using the Python API and trtexec. Surprisingly, this wasn’t the case when I was working with a T4 GPU.
I am using JetPack 5.1.1 on the AGX Orin board.
TRT exec command:
/usr/src/tensorrt/bin/trtexec --onnx=.../best.onnx --saveEngine=.../best.engine
TRT exec output:
[07/05/2023-00:55:57] [I] === Performance summary ===
[07/05/2023-00:55:57] [I] Throughput: 119.089 qps
[07/05/2023-00:55:57] [I] Latency: min = 8.48901 ms, max = 13.4727 ms, mean = 9.13869 ms, median = 8.75708 ms, percentile(90%) = 10.3955 ms, percentile(95%) = 11.6753 ms, percentile(99%) = 12.9832 ms
[07/05/2023-00:55:57] [I] Enqueue Time: min = 1.14331 ms, max = 3.45764 ms, mean = 2.09376 ms, median = 2.09064 ms, percentile(90%) = 2.96436 ms, percentile(95%) = 3.04156 ms, percentile(99%) = 3.36731 ms
[07/05/2023-00:55:57] [I] H2D Latency: min = 0.42981 ms, max = 0.866577 ms, mean = 0.737847 ms, median = 0.738739 ms, percentile(90%) = 0.774658 ms, percentile(95%) = 0.786621 ms, percentile(99%) = 0.83374 ms
[07/05/2023-00:55:57] [I] GPU Compute Time: min = 7.90393 ms, max = 12.6794 ms, mean = 8.37038 ms, median = 7.9917 ms, percentile(90%) = 9.60596 ms, percentile(95%) = 10.8478 ms, percentile(99%) = 12.171 ms
[07/05/2023-00:55:57] [I] D2H Latency: min = 0.0187988 ms, max = 0.0332031 ms, mean = 0.0304636 ms, median = 0.0306396 ms, percentile(90%) = 0.0317383 ms, percentile(95%) = 0.0319824 ms, percentile(99%) = 0.0328369 ms
[07/05/2023-00:55:57] [I] Total Host Walltime: 3.02294 s
[07/05/2023-00:55:57] [I] Total GPU Compute Time: 3.01334 s
[07/05/2023-00:55:57] [W] * GPU compute time is unstable, with coefficient of variance = 11.1399%.
[07/05/2023-00:55:57] [W] If not already in use, locking GPU clock frequency or adding --useSpinWait may improve the stability.
[07/05/2023-00:55:57] [I] Explanations of the performance metrics are printed in the verbose logs.
[07/05/2023-00:55:57] [I]
Python code:
class ModelTRT():
def __init__(self, sEnginePath, sPrecision, bEnd2end = False):
self.n_classes = 2
self.class_names = [ 'nonCoded', 'coded']
self.sPrecision = sPrecision
self.bEnd2end = bEnd2end
logger = trt.Logger(trt.Logger.WARNING)
logger.min_severity = trt.Logger.Severity.ERROR
runtime = trt.Runtime(logger)
trt.init_libnvinfer_plugins(logger,'') # initialize TensorRT plugins
with open(sEnginePath, "rb") as f:
serialized_engine = f.read()
engine = runtime.deserialize_cuda_engine(serialized_engine)
self.imgsz = engine.get_binding_shape(0)[2:] # get the read shape of model, in case user input it wrong
self.context = engine.create_execution_context()
self.inputs, self.outputs, self.bindings = [], [], []
self.stream = cuda.Stream()
for binding in engine:
size = trt.volume(engine.get_binding_shape(binding))
dtype = trt.nptype(engine.get_binding_dtype(binding))
host_mem = cuda.pagelocked_empty(size, dtype)
device_mem = cuda.mem_alloc(host_mem.nbytes)
self.bindings.append(int(device_mem))
if engine.binding_is_input(binding):
self.inputs.append({'host': host_mem, 'device': device_mem})
else:
self.outputs.append({'host': host_mem, 'device': device_mem})
def infer(self, aImage, bProfile = False):
self.inputs[0]['host'] = np.ravel(aImage)
if bProfile:
fCPU2GPUStart = time.time()
# transfer data to the gpu
for inp in self.inputs:
cuda.memcpy_htod_async(inp['device'], inp['host'], self.stream)
self.stream.synchronize()
if bProfile:
fInferenceStart = time.time()
# run inference
self.context.execute_async_v2(
bindings=self.bindings,
stream_handle=self.stream.handle
)
self.stream.synchronize()
if bProfile:
fGPU2CPUStart = time.time()
# fetch outputs from gpu
for out in self.outputs:
cuda.memcpy_dtoh_async(out['host'], out['device'], self.stream)
# synchronize stream
self.stream.synchronize()
if bProfile:
fPostProStart = time.time()
data = [out['host'] for out in self.outputs]
if self.bEnd2end:
num, final_boxes, final_scores, final_cls_inds = data
final_boxes = np.reshape(final_boxes, (-1, 4))
dets = np.concatenate([final_boxes[:num[0]], np.array(final_scores)[:num[0]].reshape(-1, 1), np.array(final_cls_inds)[:num[0]].reshape(-1, 1)], axis=-1)
else:
dets = np.reshape(data, (1, 6, -1))
if bProfile:
return dets, fCPU2GPUStart, fInferenceStart, fGPU2CPUStart, fPostProStart
else:
return dets
Python output (median time of 200 executions):
Inference time (s): 0.015452146530151367
Why such a difference in the Jetson AGX Orin board?
I attach also the onnx model.
best.onnx (11.8 MB)
Thanks in advance,