TensorRT has less batching throughput improvement than PyTorch on Jetson Nano

Description

I am current measuring batching throughput improvement on Jetson Nano. I run MobileNet V2 using both PyTorch and TensorRT. I expect a better throughput improvement in TensorRT than that of Pytorch. However, when I run my experiment, I get the following Result

PyTorch:

Batch size Inference time(ms) Throughput(fps) Improvement
1 38.82 25.76 1
8 161.93 49.40 1.92

TensorRT

Batch size Inference time(ms) Throughput(fps) Improvement
1 13.02 76.8 1
8 90.37 88.52 1.15

I have attached the scripts to yield these results. It seems that TensorRT has little throughput improvement when doing batching. Did I do something wrong?

Environment

TensorRT Version: 7.1.3
GPU Type: Maxwell
Nvidia Driver Version: L4T 32.4.3
CUDA Version: 10.02
CUDNN Version: 8.0
Operating System + Version: Ubuntu 18.04
Python Version (if applicable): 3.6.9
TensorFlow Version (if applicable):
PyTorch Version (if applicable): 1.6
Baremetal or Container (if container which image + tag):

Relevant Files

Script used to measure the throughput in PyTorch.

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import numpy as np
import torch
import time

from torchvision import models

ITER = 1000


def run_inference(model, img):
    """ Randomly do some inference """
    inputs = img.cuda()
    res = model(inputs)
    return res.cpu()


def main():
    # Get model
    mobilenet = models.mobilenet_v2(pretrained=True).float().eval().cuda()

    batch_size = 8
    input_shape = [batch_size, 3, 224, 224]

    # Warm up
    for i in range(10):
        img = torch.rand(input_shape).float()
        run_inference(mobilenet, img)

    # Record inference time
    inference_time = []
    for i in range(ITER):
        img = torch.rand(input_shape).float()

        start_t = time.time()
        res = run_inference(mobilenet, img)
        delta_t = (time.time() - start_t) * 1000

        inference_time.append(delta_t)

    avg_inference_time = np.mean(inference_time)
    throughput = 1000 * batch_size / avg_inference_time

    pattern = "{:20}: {:.2f}"
    print(pattern.format("Avg. inference time", avg_inference_time))
    print(pattern.format("Throughput", throughput))


if __name__ == '__main__':
    main()

Script used to export Mobilenet V2:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import torch
from torchvision import models


# Load model
model = models.mobilenet_v2(pretrained=True)

# export
batch_size = 8
inputs = torch.rand([batch_size, 3, 224, 224])

torch.onnx.export(model, # model to be exported
                  inputs,   # Input to run the model for tracing
                  "mobilenet_v2_bs%d.onnx" % batch_size,  # Output path
                  export_params=True, # If export parameters
                  input_names=["input"],   # Input fields
                  output_names=["output"])  # output fields

Script used to measure performance in TensorRT:

trtexec --onnx=mobilenet_v2_bs8.onnx --iterations=1000

Steps To Reproduce

To get the measurement of PyTorch, just run the script directly.

To get the measurement of TensorRT. First export the model to .onnx format from PyTorch. Then run it with trtexec.

Hi @Kevin3297,
Request you to share your onnx model.

Thanks!

Hi @AakankshaS,

Thanks for your response. The onnx model is exported from torchvision and the export script is provided in the Relevant Files. This issue can be reproduced using the script provided.

Thanks