Strange jumping results on FPS and inference time

We have custom network which is based on yolov3-tiny model implemented in pytorch. We run it on jetson nano, without any problems, except that we had jumping FPS – from 7 to 30 in some kind of cycle. It was suspicious, thus I tested it inference speed in ipython using %timeit and got pretty strange results which corresponds to what we have seen on video processing:

In [54]: %timeit -n1 -r1 pred, _ = model(img2)
23.9 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

In [55]: %timeit -n1 -r2 pred, _ = model(img2)
18.5 ms ± 4.15 ms per loop (mean ± std. dev. of 2 runs, 1 loop each)

In [56]: %timeit -n1 -r5 pred, _ = model(img2)
15.9 ms ± 3.53 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)

In [57]: %timeit -n1 -r10 pred, _ = model(img2)
15.6 ms ± 2.45 ms per loop (mean ± std. dev. of 10 runs, 1 loop each)

In [58]: %timeit -n1 -r20 pred, _ = model(img2)
The slowest run took 4.69 times longer than the fastest. This could mean that an intermediate result is being cached.
28.6 ms ± 17.1 ms per loop (mean ± std. dev. of 20 runs, 1 loop each)

In [59]: %timeit -n1 -r30 pred, _ = model(img2)
The slowest run took 4.60 times longer than the fastest. This could mean that an intermediate result is being cached.
35 ms ± 17.4 ms per loop (mean ± std. dev. of 30 runs, 1 loop each)

In [60]: %timeit -n1 -r40 pred, _ = model(img2)
38.7 ms ± 16 ms per loop (mean ± std. dev. of 40 runs, 1 loop each)

In [61]: %timeit -n1 -r50 pred, _ = model(img2)
40.8 ms ± 14.9 ms per loop (mean ± std. dev. of 50 runs, 1 loop each)

In [62]: %timeit -n1 -r75 pred, _ = model(img2)
The slowest run took 5.96 times longer than the fastest. This could mean that an intermediate result is being cached.
43.5 ms ± 17.1 ms per loop (mean ± std. dev. of 75 runs, 1 loop each)

In [63]: %timeit -n1 -r200 pred, _ = model(img2)
The slowest run took 5.82 times longer than the fastest. This could mean that an intermediate result is being cached.
47.5 ms ± 10.7 ms per loop (mean ± std. dev. of 200 runs, 1 loop each)

In [64]: %timeit -n1 -r500 pred, _ = model(img2)
The slowest run took 5.86 times longer than the fastest. This could mean that an intermediate result is being cached.
48.7 ms ± 9.01 ms per loop (mean ± std. dev. of 500 runs, 1 loop each)

Both model and img are in GPU memory (fp16 precision), we are spending no time for pre\post processing.

Why speed is jumping in x6 range?

Hi b.a.kabakov, have you tried running “sudo jetson_clocks” to see if that helps?

Also, for comparision, have you tried running TinyYOLO-v3 benchmark from here: [url]https://devtalk.nvidia.com/default/topic/1050377/jetson-nano/deep-learning-inference-benchmarking-instructions/[/url]

If you don’t see the issue elsewhere, perhaps it is something inside of your application causing the indeterminism.

have you tried running “sudo jetson_clocks” to see if that helps?

No, but it did not help.

have you tried running TinyYOLO-v3 benchmark

Yes, but it will never show floating time, because it uses next formulae (from deepstream_reference_apps/yolo/apps/trt-yolo/trt-yolo-app.cpp:L190):

<< " Inference time per image : " << inferElapsed / imageList.size() << " ms"

Where inferElapsed is just cumulative sum of individual inference times (same file, L120).

Anyway, I think that it’s jetson’s pytorch problem, please see code below:

import torch
import torchvision.models as models
import numpy as np

device = torch.device('cuda:0')

mobilenet = models.mobilenet_v2(pretrained=True)
mobilenet.to(device)

# create random image 
tmp = (np.random.standard_normal([1, 3, 416, 416]) * 255).astype(np.uint8)
img = torch.from_numpy(tmp.astype(np.float32)).to(device)

outs = mobilenet(img)


%timeit -n1 -r1 out = mobilenet(img2)
>> 40.9 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

%timeit -n1 -r2 out = mobilenet(img2)
>> 40.5 ms ± 850 µs per loop (mean ± std. dev. of 2 runs, 1 loop each)

%timeit -n1 -r10 out = mobilenet(img2)
>> 82.4 ms ± 32.8 ms per loop (mean ± std. dev. of 10 runs, 1 loop each)

%timeit -n1 -r20 out = mobilenet(img2)
>> The slowest run took 4.24 times longer than the fastest. This could mean that an intermediate result is being cached.
95.8 ms ± 34.6 ms per loop (mean ± std. dev. of 20 runs, 1 loop each)

As you may see from code above, I used only pytorch’s standard models. Also pytorch and torchvision installed by instructions from https://devtalk.nvidia.com/default/topic/1049071/jetson-nano/pytorch-for-jetson-nano/post/5326262/

Above code causes no problems and time-float on GeForce 1080ti with torch-nightly-1.2.0.dev20190530 and torchvision-nightly-0.2.3

I also tried torch-1.1.0 and torchvision-0.3.0 on GeForce 1080ti – all is OK.

And I have measured inference times for tensorflow, installed by instructions from https://devtalk.nvidia.com/default/topic/1048776/official-tensorflow-for-jetson-nano-/ – no problems with floating inference time.

import tensorflow as tf
import numpy as np
import tensorflow.contrib.keras as K


config = tf.ConfigProto()
config.gpu_options.allow_growth = True
tf.Session(config=config)

K.backend.clear_session()
K.backend.set_learning_phase(0)
K.backend.set_floatx('float32')
with K.backend.get_session():
    mobilenet = tf.keras.applications.mobilenet.MobileNet(input_shape=None, alpha=1.0, depth_multiplier=1, dropout=1e-3, include_top=True, weights='imagenet', input_tensor=None, pooling=None, classes=1000)

    tmp = (np.random.standard_normal([1, 224, 224, 3]) * 255).astype(np.uint8)
    img = tmp.astype(np.float32)
    outs = mobilenet(img)


%timeit -n1 -r1  outs = mobilenet(img)
>> 1.08 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

%timeit -n1 -r2  outs = mobilenet(img)
>> 1.08 s ± 8.53 ms per loop (mean ± std. dev. of 2 runs, 1 loop each)

%timeit -n1 -r10  outs = mobilenet(img)
>> 1.07 s ± 5.48 ms per loop (mean ± std. dev. of 10 runs, 1 loop each)

%timeit -n1 -r20  outs = mobilenet(img)
>> 1.09 s ± 22.3 ms per loop (mean ± std. dev. of 20 runs, 1 loop each)

Same results with pytorch wheels from [url]pytorch/README.md at master · pytorch/pytorch · GitHub

If you increase the number of runs, what happens? Can you print out the inferencing time of each run to access any anomalies?

I added some code to previous example:

def one_inf():
    start = time.time()
    outs = mobilenet(img)
    end = time.time()
    # return ms
    return (end - start) * 1000

all = []
for i in range(50):
    tmp = one_inf()
    print(tmp)
    all.append(tmp)

This is the output:

41.43524169921875
39.82234001159668
38.2380485534668
49.45063591003418
108.7501049041748
108.68954658508301
108.67023468017578
108.56866836547852
108.70575904846191
108.77728462219238
55.8319091796875
161.68594360351562
108.66403579711914
57.85846710205078
159.58142280578613
108.57248306274414
72.20315933227539
145.62702178955078
108.64663124084473
94.12503242492676
123.34537506103516
108.6277961730957
95.72529792785645
121.85072898864746
108.57295989990234
103.2254695892334
114.29953575134277
108.66355895996094
106.93120956420898
110.40639877319336
108.76774787902832
107.11097717285156
110.24236679077148
108.70885848999023
108.79349708557129
108.79206657409668
108.68215560913086
108.76607894897461
108.71410369873047
108.69383811950684
108.64710807800293
108.77037048339844
108.76750946044922
108.77704620361328
108.69765281677246
108.66308212280273
108.69169235229492
108.8111400604248
108.82806777954102
108.67834091186523

Also, I asked same question on pytorch github: https://github.com/pytorch/pytorch/issues/21286

PyTorch developers answered that proper way to timeit is:

torch.cuda.synchronize(); pred, _ = model(img2); torch.cuda.synchronize();

Our conversation on github in short form:

Q:may I conclude that “torch.cuda.synchronize();” is a proper way to measure theoretical maximum of network inference speed?
A:Yes, if you are timing GPU you should use torch.cuda.synchronize(). You need it on both Jetson and the GeForce card. Otherwise you might only be measuring the time to enqueue the operations on the GPU and not the time to actually execute them (because of CUDA API’s asynchronous nature.).

Q: why I have so different behavior on Jetson Nano and GeForce 1080ti, they are both have same API, are they?
A:Jetson and GeForce 1080ti have the same API, but different CPUs and different number of CUDA cores. Both >of these affect timing. (Just because the 1080 ti numbers seemed stable doesn’t mean they were accurate).

Does it seems appropriate from a NVIDIA developer’s point of view?

Yes, that would make sense. There is an API to cudaEvents that allows for asynchronous timing that would seem PyTorch doesn’t use, so in lieu of that, you can use the synchronization. You wouldn’t want to have that extra synchronization in production code, but for profiling during development it should be ok.