Inference is so slow with torch1.6

cogbot · September 18, 2020, 12:30pm

I am trying the following code in my xavier nx.

import torch
import torchvision.models as models
import numpy as np
import timeit

def inference_test():
    device = torch.device('cuda:0')

    # Create model and input.
    model = models.resnet50(pretrained=True)
    tmp = (np.random.standard_normal([1, 3, 224, 224]) * 255).astype(np.uint8)

    # move them to the device 
    model.to(device)   
    img = torch.from_numpy(tmp.astype(np.float32)).to(device)

    def infer():
        outs = model(img)

    print(timeit.timeit(stmt=infer, number=20))

output:

5.264096206999966

$ jetson_release

NVIDIA Jetson Xavier NX (Developer Kit Version)
- Jetpack 4.4 [L4T 32.4.3]
- NV Power Mode: MODE_15W_6CORE - Type: 2
- jetson_stats.service: active
Libraries:
- CUDA: 10.2.89
- cuDNN: 8.0.0.180
- TensorRT: 7.1.3.0
- Visionworks: 1.6.0.501
- OpenCV: 4.1.1 compiled CUDA: NO
- VPI: 0.3.7
- Vulkan: 1.2.70

$ sudo jetson_clocks --show

SOC family:tegra194  Machine:NVIDIA Jetson Xavier NX Developer Kit
Online CPUs: 0-5
CPU Cluster Switching: Disabled
cpu0: Online=1 Governor=schedutil MinFreq=1190400 MaxFreq=1420800 CurrentFreq=1420800 IdleStates: C1=1 c6=1 
cpu1: Online=1 Governor=schedutil MinFreq=1190400 MaxFreq=1420800 CurrentFreq=1420800 IdleStates: C1=1 c6=1 
cpu2: Online=1 Governor=schedutil MinFreq=1190400 MaxFreq=1420800 CurrentFreq=1420800 IdleStates: C1=1 c6=1 
cpu3: Online=1 Governor=schedutil MinFreq=1190400 MaxFreq=1420800 CurrentFreq=1420800 IdleStates: C1=1 c6=1 
cpu4: Online=1 Governor=schedutil MinFreq=1190400 MaxFreq=1420800 CurrentFreq=1420800 IdleStates: C1=1 c6=1 
cpu5: Online=1 Governor=schedutil MinFreq=1190400 MaxFreq=1420800 CurrentFreq=1420800 IdleStates: C1=1 c6=1 
GPU MinFreq=114750000 MaxFreq=1109250000 CurrentFreq=114750000
EMC MinFreq=204000000 MaxFreq=1600000000 CurrentFreq=204000000 FreqOverride=0
Fan: speed=0
NV Power Mode: MODE_15W_6CORE

Python:

Python 3.6.9 (default, Jul 17 2020, 12:50:27) 
[GCC 8.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.

torch and torchvision:

import torch
torch.version
‘1.6.0’
import torchvision
torchvision.version
‘0.7.0a0+78ed10c’

can you please help to debug the issue? Thanks

By the way, when I benchmark resnet50 using jetson_benchmark I get expected resutls.

sudo python3 benchmark.py --model_name resnet --csv_file_path <path-to>/benchmark_csv/nx-benchmarks.csv --model_dir <absolute-path-to-downloaded-models>

Output:

Please close all other applications and Press Enter to continue...
Setting Jetson xavier-nx in max performance mode
gpu frequency is set from 114750000 Hz --> to 1109250000 Hz
dla frequency is set from 1100800000 Hz --> to 1100800000 Hz
------------Executing ResNet50_224x224------------

--------------------------

Model Name: ResNet50_224x224 
FPS:837.79 

--------------------------

Wall Time for running model (secs): 405.63667702674866

dusty_nv · September 18, 2020, 3:29pm

In my experience with PyTorch, the very first inference or training run takes longer, I think it is loading in lots of code pages for kernels or something.

Can you try doing a warm-up of say, 100 iterations, before measuring the speed? Also you will want to time more than 20 iterations. It may also be helpful to run sudo jetson_clocks beforehand, if you haven’t already.

cogbot · September 18, 2020, 10:26pm

@dusty_nv, Yes. I also noticed that first time it takes lot more time. I tested as you said, still no improvement.
I did as follows:

change my code for 100 iterations of warm up and test.

print("warmup: ", timeit.timeit(stmt=infer, number=100))
print("test: ", timeit.timeit(stmt=infer, number=100))

Run jetson_clocks

sudo jetson_clocks

Run the test
Output:

warmup:  7.369351925000956
test:  4.321095194005466

AastaLLL · September 21, 2020, 3:50am

Hi,

We can get roughly 29.76 fps with the following testing code .

def infer():
    tic = time.time()
    for i in range(100):
        model(img)
    print("warmup: ", time.time()-tic)

    tic = time.time()
    for i in range(100):
        model(img)
    print("test: ", time.time()-tic)

warmup:  5.711843490600586
test:  3.3672728538513184

Based on the tegrastats results, pyTorch seems reaches its limitation.

... GR3D_FREQ 99%@1109

This also similar to our benchmark result shared in this page:

To get the performance similar to jetson_benchmark, you will need to convert the model into TensorRT engine first.
Based on the result in torch2trt, the fps can increase from 55.5 (PyTorch) into 312 (TensorRT) on Xavier.

Thanks.

cogbot · September 21, 2020, 8:49am

torch2trt listed that fps is 55.5(pytorch) and 312(tensorrt) when datatype is float16 as listed here. But I get
1.fps 46 when I use pytorch with float16.
2. fps 67 when I use tensorrt with float16.
What am I missing here? I am using the following code:

import torch
import timeit
import torchvision.models as models
import numpy as np
from time import time
# from torch2trt import torch2trt

def inference_test():
    device = torch.device('cuda:0')

    # Create model and input.
    model = models.resnet50(pretrained=True).half()
    tmp = (np.random.standard_normal([1, 3, 224, 224]) * 255).astype(np.uint8)  
    # tmp = (np.random.standard_normal([1, 3, 416, 416]) * 255).astype(np.uint8)  #mobilenet_v2

    # move them to the device 
    model.eval()
    model.to(device)   
    img = torch.from_numpy(tmp.astype(np.float16)).to(device)

    # convert to TensorRT feeding sample data as input
    # model_trt = torch2trt(model, [img])

    def infer():
        with torch.no_grad():
            before = time()
            outs = model(img)
            # outs = model_trt(img)
            infer_time = time() - before
        return infer_time

    print("Running warming up iterations..")
    for i in range(0, 100):
        infer()

    total_times = timeit.repeat(stmt=infer, repeat=1, number=500)    
    print("Timeit.repeat: ", total_times)
    print("FPS: ", 500 / np.array(total_times).mean())

inference_test()

AastaLLL · September 22, 2020, 5:38am

Hi,

The difference between 55.5 and 46 may cause by the overhead in loop.

However, the 67fps for TensorRT is weird to us.
May I know how do you use convert the engine into TensorRT?

Thanks.

cogbot · September 22, 2020, 7:21am

The difference between 55.5 and 46 may cause by the overhead in loop.

What does it mean?
I am counting time just for the forward pass like below:

before = time()
outs = model(img)
infer_time = time() - before

Please check my code above.

May I know how do you use convert the engine into TensorRT?

Yes. I am using torch2trt library from here.

AastaLLL · September 23, 2020, 7:02am

Hi,

Would you mind to share the log of converting model into TensorRT with us.
It seem that you are still using the pyTorch rather than TensorRT.

Thanks.

cogbot · September 23, 2020, 10:31am

Hi @AastaLLL, Thanks for your reply. I converted as follows

model_trt = torch2trt(model, [img])

Couldn’t find how to print log, but found that fp16_mode=False was the default. After setting it true as below I get FPS 201.

model_trt = torch2trt(model, [img], fp16_mode=True)

I find it’s not straightforward to deploy torch/detectron2 Faster RCNN/Mask RCNN models on Xavier NX. But I will be waiting for your findings as you said here.

Now I am also checking for other options. I am a bit confused after finding so many options.

I am looking for a pipeline to deploy object detectors/keypoint detectors on my Jetson Xavier NX. My model should be able to process in real-time. I have to run this project for a long time. Later on, I have to deploy an instance segmentation model on HPC with Nvidia cards for another project.

Can you please suggest me which route should I go?

AastaLLL · October 13, 2020, 9:12am

Hi,

Sorry that we still need some time for the detectron2 model issue.
Will update to Detectron2 Jetson NX once the experiment is done.

For XavierNX, the fastest mode is INT8 but it requires a calibration cache file.
If performance is critical for your usage, it’s recommended to try INT8 with TensorRT.

Thanks.

cogbot · October 14, 2020, 7:37am

Hi @AastaLLL, thanks. Good to know that you are working on it. Looking forward to the update.

AastaLLL · October 23, 2020, 6:28am

Hi,

Here are some status sharing with you.
There are some non-supported layers used in the Detectro2 model, ex. generateProposals, CollectRpnProposals, …, etc.

We are working on adding these layers to our plugin library or onnx parser.
Once it is done, we will update this to the topic.

Thanks.

cogbot · October 23, 2020, 9:45am

@AastaLLL thanks again for your update. Looking forward to…