Strange jumping results on FPS and inference time

b.a.kabakov · May 31, 2019, 3:52pm

We have custom network which is based on yolov3-tiny model implemented in pytorch. We run it on jetson nano, without any problems, except that we had jumping FPS – from 7 to 30 in some kind of cycle. It was suspicious, thus I tested it inference speed in ipython using %timeit and got pretty strange results which corresponds to what we have seen on video processing:

In [54]: %timeit -n1 -r1 pred, _ = model(img2)
23.9 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

In [55]: %timeit -n1 -r2 pred, _ = model(img2)
18.5 ms ± 4.15 ms per loop (mean ± std. dev. of 2 runs, 1 loop each)

In [56]: %timeit -n1 -r5 pred, _ = model(img2)
15.9 ms ± 3.53 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)

In [57]: %timeit -n1 -r10 pred, _ = model(img2)
15.6 ms ± 2.45 ms per loop (mean ± std. dev. of 10 runs, 1 loop each)

In [58]: %timeit -n1 -r20 pred, _ = model(img2)
The slowest run took 4.69 times longer than the fastest. This could mean that an intermediate result is being cached.
28.6 ms ± 17.1 ms per loop (mean ± std. dev. of 20 runs, 1 loop each)

In [59]: %timeit -n1 -r30 pred, _ = model(img2)
The slowest run took 4.60 times longer than the fastest. This could mean that an intermediate result is being cached.
35 ms ± 17.4 ms per loop (mean ± std. dev. of 30 runs, 1 loop each)

In [60]: %timeit -n1 -r40 pred, _ = model(img2)
38.7 ms ± 16 ms per loop (mean ± std. dev. of 40 runs, 1 loop each)

In [61]: %timeit -n1 -r50 pred, _ = model(img2)
40.8 ms ± 14.9 ms per loop (mean ± std. dev. of 50 runs, 1 loop each)

In [62]: %timeit -n1 -r75 pred, _ = model(img2)
The slowest run took 5.96 times longer than the fastest. This could mean that an intermediate result is being cached.
43.5 ms ± 17.1 ms per loop (mean ± std. dev. of 75 runs, 1 loop each)

In [63]: %timeit -n1 -r200 pred, _ = model(img2)
The slowest run took 5.82 times longer than the fastest. This could mean that an intermediate result is being cached.
47.5 ms ± 10.7 ms per loop (mean ± std. dev. of 200 runs, 1 loop each)

In [64]: %timeit -n1 -r500 pred, _ = model(img2)
The slowest run took 5.86 times longer than the fastest. This could mean that an intermediate result is being cached.
48.7 ms ± 9.01 ms per loop (mean ± std. dev. of 500 runs, 1 loop each)

Both model and img are in GPU memory (fp16 precision), we are spending no time for pre\post processing.

Why speed is jumping in x6 range?

dusty_nv · May 31, 2019, 5:04pm

Hi b.a.kabakov, have you tried running “sudo jetson_clocks” to see if that helps?

Also, for comparision, have you tried running TinyYOLO-v3 benchmark from here: [url]https://devtalk.nvidia.com/default/topic/1050377/jetson-nano/deep-learning-inference-benchmarking-instructions/[/url]

If you don’t see the issue elsewhere, perhaps it is something inside of your application causing the indeterminism.

b.a.kabakov · June 3, 2019, 2:18pm

have you tried running “sudo jetson_clocks” to see if that helps?

No, but it did not help.

have you tried running TinyYOLO-v3 benchmark

Yes, but it will never show floating time, because it uses next formulae (from deepstream_reference_apps/yolo/apps/trt-yolo/trt-yolo-app.cpp:L190):

<< " Inference time per image : " << inferElapsed / imageList.size() << " ms"

Where inferElapsed is just cumulative sum of individual inference times (same file, L120).

Anyway, I think that it’s jetson’s pytorch problem, please see code below:

import torch
import torchvision.models as models
import numpy as np

device = torch.device('cuda:0')

mobilenet = models.mobilenet_v2(pretrained=True)
mobilenet.to(device)

# create random image 
tmp = (np.random.standard_normal([1, 3, 416, 416]) * 255).astype(np.uint8)
img = torch.from_numpy(tmp.astype(np.float32)).to(device)

outs = mobilenet(img)


%timeit -n1 -r1 out = mobilenet(img2)
>> 40.9 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

%timeit -n1 -r2 out = mobilenet(img2)
>> 40.5 ms ± 850 µs per loop (mean ± std. dev. of 2 runs, 1 loop each)

%timeit -n1 -r10 out = mobilenet(img2)
>> 82.4 ms ± 32.8 ms per loop (mean ± std. dev. of 10 runs, 1 loop each)

%timeit -n1 -r20 out = mobilenet(img2)
>> The slowest run took 4.24 times longer than the fastest. This could mean that an intermediate result is being cached.
95.8 ms ± 34.6 ms per loop (mean ± std. dev. of 20 runs, 1 loop each)

As you may see from code above, I used only pytorch’s standard models. Also pytorch and torchvision installed by instructions from https://devtalk.nvidia.com/default/topic/1049071/jetson-nano/pytorch-for-jetson-nano/post/5326262/

Above code causes no problems and time-float on GeForce 1080ti with torch-nightly-1.2.0.dev20190530 and torchvision-nightly-0.2.3

b.a.kabakov · June 3, 2019, 2:30pm

I also tried torch-1.1.0 and torchvision-0.3.0 on GeForce 1080ti – all is OK.

And I have measured inference times for tensorflow, installed by instructions from https://devtalk.nvidia.com/default/topic/1048776/official-tensorflow-for-jetson-nano-/ – no problems with floating inference time.

import tensorflow as tf
import numpy as np
import tensorflow.contrib.keras as K


config = tf.ConfigProto()
config.gpu_options.allow_growth = True
tf.Session(config=config)

K.backend.clear_session()
K.backend.set_learning_phase(0)
K.backend.set_floatx('float32')
with K.backend.get_session():
    mobilenet = tf.keras.applications.mobilenet.MobileNet(input_shape=None, alpha=1.0, depth_multiplier=1, dropout=1e-3, include_top=True, weights='imagenet', input_tensor=None, pooling=None, classes=1000)

    tmp = (np.random.standard_normal([1, 224, 224, 3]) * 255).astype(np.uint8)
    img = tmp.astype(np.float32)
    outs = mobilenet(img)


%timeit -n1 -r1  outs = mobilenet(img)
>> 1.08 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

%timeit -n1 -r2  outs = mobilenet(img)
>> 1.08 s ± 8.53 ms per loop (mean ± std. dev. of 2 runs, 1 loop each)

%timeit -n1 -r10  outs = mobilenet(img)
>> 1.07 s ± 5.48 ms per loop (mean ± std. dev. of 10 runs, 1 loop each)

%timeit -n1 -r20  outs = mobilenet(img)
>> 1.09 s ± 22.3 ms per loop (mean ± std. dev. of 20 runs, 1 loop each)

b.a.kabakov · June 3, 2019, 3:51pm

Same results with pytorch wheels from [url]pytorch/README.md at master · pytorch/pytorch · GitHub

dusty_nv · June 4, 2019, 1:53pm

If you increase the number of runs, what happens? Can you print out the inferencing time of each run to access any anomalies?

b.a.kabakov · June 4, 2019, 5:19pm

I added some code to previous example:

def one_inf():
    start = time.time()
    outs = mobilenet(img)
    end = time.time()
    # return ms
    return (end - start) * 1000

all = []
for i in range(50):
    tmp = one_inf()
    print(tmp)
    all.append(tmp)

This is the output:

41.43524169921875
39.82234001159668
38.2380485534668
49.45063591003418
108.7501049041748
108.68954658508301
108.67023468017578
108.56866836547852
108.70575904846191
108.77728462219238
55.8319091796875
161.68594360351562
108.66403579711914
57.85846710205078
159.58142280578613
108.57248306274414
72.20315933227539
145.62702178955078
108.64663124084473
94.12503242492676
123.34537506103516
108.6277961730957
95.72529792785645
121.85072898864746
108.57295989990234
103.2254695892334
114.29953575134277
108.66355895996094
106.93120956420898
110.40639877319336
108.76774787902832
107.11097717285156
110.24236679077148
108.70885848999023
108.79349708557129
108.79206657409668
108.68215560913086
108.76607894897461
108.71410369873047
108.69383811950684
108.64710807800293
108.77037048339844
108.76750946044922
108.77704620361328
108.69765281677246
108.66308212280273
108.69169235229492
108.8111400604248
108.82806777954102
108.67834091186523

Also, I asked same question on pytorch github: https://github.com/pytorch/pytorch/issues/21286

b.a.kabakov · June 6, 2019, 5:17pm

PyTorch developers answered that proper way to timeit is:

torch.cuda.synchronize(); pred, _ = model(img2); torch.cuda.synchronize();

Our conversation on github in short form:

Q:may I conclude that “torch.cuda.synchronize();” is a proper way to measure theoretical maximum of network inference speed?
A:Yes, if you are timing GPU you should use torch.cuda.synchronize(). You need it on both Jetson and the GeForce card. Otherwise you might only be measuring the time to enqueue the operations on the GPU and not the time to actually execute them (because of CUDA API’s asynchronous nature.).

Q: why I have so different behavior on Jetson Nano and GeForce 1080ti, they are both have same API, are they?
A:Jetson and GeForce 1080ti have the same API, but different CPUs and different number of CUDA cores. Both >of these affect timing. (Just because the 1080 ti numbers seemed stable doesn’t mean they were accurate).

Does it seems appropriate from a NVIDIA developer’s point of view?

dusty_nv · June 6, 2019, 6:27pm

Yes, that would make sense. There is an API to cudaEvents that allows for asynchronous timing that would seem PyTorch doesn’t use, so in lieu of that, you can use the synchronization. You wouldn’t want to have that extra synchronization in production code, but for profiling during development it should be ok.

Topic		Replies	Views
Inference Speed Jetson Xavier NX pytorch	6	847	April 12, 2023
Why my inference time is so long when using trtexec - FP16? Jetson TX2 jetson-inference	4	1922	October 18, 2021
Inference is so slow with torch1.6 Jetson Xavier NX nvbugs , pytorch	12	3529	October 23, 2020
converting a frozen graph to tensorRT Jetson Nano	5	1785	October 14, 2021
Timing Issue with TensorRT Model on Jetson Orin Nano Using TensorFlow Framework Jetson Orin Nano tensorrt , cuda , tensorflow , python	7	53	December 19, 2024
DeepStream 5 vs 6 inference time and calculate fps in the pipeline on Jetson Nano DeepStream SDK	9	2812	January 14, 2022
Inference with deepstream yolov5s-3.0 on 2 camera long delay (20-25s) DeepStream SDK	18	2268	October 12, 2021
Jetson nano sometimes extremely slow with GPU Jetson Nano cuda , pytorch	7	982	November 3, 2023
How to run pytorch custom inference on Jetson Nano's GPU? Jetson Nano pytorch	4	1135	June 21, 2022
Separate GPU for Parallel on Jeton AgxOrin Jetson AGX Xavier gpu	19	51	August 28, 2024

Strange jumping results on FPS and inference time

Related topics