Inference time on Jetson Xavier compared with local host PC?

Hello, everyone. I compared the inference time on Jetson Xavier and local host PC. I found infered a 28 pixels*28 pixels grayscale MNIST image on my local host PC only takes 0.08 ms, while on the Jetson Xavier needs 0.6 ms. I am confused about the comparison results.

1.In my opinion, the inference on the Jetson Xavier must be faster than the host PC. But it shows the otherwise results.

2.The Nvidia only gives a simple example on model training on DIGITS(lenet5) with TensorFlow framework, but it didn’t give detailed procedures on how to make inference with the trained model(form DIGITS) on Jetson Xavier.

3.I have tried to train a lenet5 model, convert it to lenet5.pb, then convert it to lenet5.uff, when make inference on the Jetson Xavier, it failed.

Does anyone have the same questions?
If you have the solutions of the above questions, I would be very appreciated for you kind response.


Could you share how do you measure the MNIST performance.

1.If you are using TensorFlow, it’s recommended to run it with TensorRT instead.
TensorFlow implementation doesn’t optimize for ARM and Jetson architecture, which may have much lower performance.

2. We have lots of inference example. You can start from these tutorial:

3. TensorRT has a TensorFlow-based LeNet5 example. It can give you some information.

By the way, please remember to maximize the device performance before benchmark.

sudo nvpmodel -m 0
sudo jetson_clocks



1.I followed the sampleUffMNIST sample, the sample infered 10 MNIST images, when the power is 15W, the averaged run time is about 0.6ms, when I maximize the device performance, the averaged run time decreased to about 0.4ms.

2.For I failed to modify the given sampleUffMNIST sample(such as infer 100 images), I just take the 10 images results as the benchmark(0.4 ms per MNIST images).

3.While in my host PC, I train a lenet5 model with MNIST dataset(45002 images as train data,14998 images as test data),here is the network:

def LeNet5(w_path=None):

    input_shape = (1, img_rows, img_cols)
    img_input = Input(shape=input_shape)

    x = Conv2D(32, (3, 3), activation="relu", padding="same", name="conv1")(img_input)
    x = MaxPooling2D((3, 3), strides=(2, 2), name='pool1')(x)
    x = Conv2D(64, (3, 3), activation="relu", padding='same', name='conv2')(x)
    x = MaxPooling2D((2, 2), strides=(2, 2), name='pool2')(x)
    x = Dropout(0.25)(x)

    x = Flatten(name='flatten')(x)

    x = Dense(128, activation='relu', name='fc1')(x)
    x = Dropout(0.5)(x)
    x = Dense(128, activation='relu', name='fc2')(x)
    x = Dropout(0.5)(x)
    x = Dense(10, activation='softmax', name='predictions')(x)

    model = Model(img_input, x, name='LeNet5')
    if w_path:

    return model

Although my trained model may has little difference with the sampleUffMNIST, they are both lenet5 network. Using my trained lenet5 make inference 60000 MNIST images only need 5s(that’s 0.08 ms per images).

So I have doubts on the time comparison that host PC inference is faster than Jetson Xavier.

Is it due to the sampleUffMNIST need to create the TensorRT engine,and the sample only infered 10 images,the TensorRT engine creation time occupied lots of time?

That was just my guess, hope you have reasonable explainations.


Could you share the detail information for your host PC with us first?
Do you have a desktop GPU on the PC?


My host PC detailed information:

Intel(R) Core(TM) i7-8700 CPU @ 3.20GHz
total processors:12

Yes,I have a desktop GPU on the PC, it is GeForce GT 730. But I think when I make inference using my trained model, it didn’t use the GPU.
I have done another experiment, making inference on my colleague’s PC which has no GPU.It still takes 5s inferring 60000 MNIST images. The same results as before.

Could you please compare the inference time on you side? (Take the simple lenet5 for example,verify my inference results)
Find the reasons why inference on Jetson takes longer time than on host PC without GPU.



Please find this page for the Xavier performance benchmark:

For Xaiver+15W+MNIST(with AlexNet):
It should run with 299 image/sec (batchsize=1) ~ 2270 image/sec (b=128).

A possible reason is that TensorRT requires a compiling time when generate TRT engine from uff model.
This may take several minuites to choose a fast kernel based on the model and GPU architecure.

However, this is an one-time job and it is only needed when first time launch.
You can always create a TensorRT engine with the serialized file after first time.

Is it possible that your performance score includes the model compiling time?


Hi,AastaLLL. I am not sure whether the performance scorce include the model complining time.For I directly use your samples-sampleUffMNIST under the tensorrt.
I just run the sampleUffMNIST,I think the sampleUffMNIST has created a TensorRT engine with the seralized file. When I run the sampleUffMNIST,it won’t create the TensorRT engine again,am I right?

Thanks for your patient response.


sampleUffMNIST is demonstrate how to create a TensorRT engine from the uff file.
So it re-compile TensorRT engine from the MNIST uff each time.

To profile TensorRT, it’s recommended to use our trtexec which is located at /usr/src/tensorrt/bin/ instead.

sudo ./trtexec --uff=../data/mnist/lenet5.uff --uffInput=in,1,28,28 --output=out