Hello AI World - now supports Python and onboard training with PyTorch!

If you are using jetson-inference, it will automatically resize the incoming image to the resolution that the network expects (in this case, 224x224).

It also will perform other pre-processing for you like putting the image into CHW layout, mapping pixel values from 0-255 → 0-1.0 (or whatever the expected pixel range is), and applying mean pixel subtraction and at times standard deviation normalization (depends on what the model expects).

I tried setting up a basic Pytorch MNIST example (28x28 images) and exported the model to ONNX like this:

torch.save(model.state_dict(), 'mnist.pt')
dummy_input = torch.randn(1, 1, 28, 28).to(device)
torch.onnx.export(model, dummy_input, 'mnist.onnx', verbose=True)

Then I tried running this command line:

python3 imagenet-camera.py --model=/home/jetson/Notebooks/mnist.onnx --input_blob=input_0 --output_blob=output_0 --labels=/home/jetson/Notebooks/mnist_labels.txt

It gives me this error:

jetson.inference -- initializing Python 3.6 bindings...
jetson.inference -- registering module types...
jetson.inference -- done registering module types
jetson.inference -- done Python 3.6 binding initialization
jetson.utils -- initializing Python 3.6 bindings...
jetson.utils -- registering module functions...
jetson.utils -- done registering module functions
jetson.utils -- registering module types...
jetson.utils -- done registering module types
jetson.utils -- done Python 3.6 binding initialization
jetson.inference -- PyTensorNet_New()
jetson.inference -- PyImageNet_Init()
jetson.inference -- imageNet loading network using argv command line params
jetson.inference -- imageNet.__init__() argv[0] = 'imagenet-camera.py'
jetson.inference -- imageNet.__init__() argv[1] = '--model=/home/jetson/Notebooks/mnist.onnx'
jetson.inference -- imageNet.__init__() argv[2] = '--input_blob=input_0'
jetson.inference -- imageNet.__init__() argv[3] = '--output_blob=output_0'
jetson.inference -- imageNet.__init__() argv[4] = '--labels=/home/jetson/Noteboks/mnist_labels.txt'

imageNet -- loading classification network model from:
         -- prototxt     (null)
         -- model        /home/jetson/Notebooks/mnist.onnx
         -- class_labels /home/jetson/Notebooks/mnist_labels.txt
         -- input_blob   'input_0'
         -- output_blob  'output_0'
         -- batch_size   1

[TRT]   TensorRT version 5.1.6
[TRT]   loading NVIDIA plugins...
[TRT]   Plugin Creator registration succeeded - GridAnchor_TRT
[TRT]   Plugin Creator registration succeeded - NMS_TRT
[TRT]   Plugin Creator registration succeeded - Reorg_TRT
[TRT]   Plugin Creator registration succeeded - Region_TRT
[TRT]   Plugin Creator registration succeeded - Clip_TRT
[TRT]   Plugin Creator registration succeeded - LReLU_TRT
[TRT]   Plugin Creator registration succeeded - PriorBox_TRT
[TRT]   Plugin Creator registration succeeded - Normalize_TRT
[TRT]   Plugin Creator registration succeeded - RPROI_TRT
[TRT]   Plugin Creator registration succeeded - BatchedNMS_TRT
[TRT]   completed loading NVIDIA plugins.
[TRT]   detected model format - ONNX  (extension '.onnx')
[TRT]   desired precision specified for GPU: FASTEST
[TRT]   requested fasted precision for device GPU without providing valid calibrator, disabling INT8
[TRT]   native precisions detected for GPU:  FP32, FP16
[TRT]   selecting fastest native precision for GPU:  FP16
[TRT]   attempting to open engine cache file /home/jetson/Notebooks/mnist.onnx.1.1.GPU.FP16.engine
[TRT]   loading network profile from engine cache... /home/jetson/Notebooks/mnist.onnx.1.1.GPU.FP16.engine
[TRT]   device GPU, /home/jetson/Notebooks/mnist.onnx loaded
[TRT]   device GPU, CUDA engine context initialized with 2 bindings
[TRT]   binding -- index   0
               -- name    'input.1'
               -- type    FP32
               -- in/out  INPUT
               -- # dims  3
               -- dim #0  1 (CHANNEL)
               -- dim #1  28 (SPATIAL)
               -- dim #2  28 (SPATIAL)
[TRT]   binding -- index   1
               -- name    '31'
               -- type    FP32
               -- in/out  OUTPUT
               -- # dims  1
[TRT]   warning -- unknown nvinfer1::DimensionType (127)
               -- dim #0  10 (UNKNOWN)
[TRT]   binding to input 0 input_0  binding index:  -1
[TRT]   binding to input 0 input_0  dims (b=1 c=0 h=0 w=0) size=0
[TRT]   failed to alloc CUDA mapped memory for tensor input, 0 bytes
[TRT]   failed to load /home/jetson/Notebooks/mnist.onnx
[TRT]   imageNet -- failed to initialize.
jetson.inference -- imageNet failed to load built-in network 'googlenet'
Traceback (most recent call last):
  File "imagenet-camera.py", line 47, in <module>
    net = jetson.inference.imageNet(opt.network, sys.argv)
Exception: jetson.inference -- imageNet failed to load network

Hi tiscone, I think the problem may be that you are not specifying the input/output layer names to torch.onnx.export. In your imagenet-camera log, TensorRT reports the binding names as “input.1” and “32”, so they haven’t been set.

So you could either try changing the arguments to imagenet-camera to --input-blob=input.1 --output-blob=32 , or re-export the ONNX model like this:

torch.save(model.state_dict(), 'mnist.pt')
dummy_input = torch.randn(1, 1, 28, 28).to(device)

input_names = [ "input_0" ]
output_names = [ "output_0" ]

torch.onnx.export(model, dummy_input, 'mnist.onnx', verbose=True, input_names=input_names, output_names=output_names)

I forgot about those labels it wants. I added them like you said and all the errors go away, but now I’m not sure what it’s doing.

All the code I’m using to train and generate the ONNX on the Jetson Nano is in this gist, if you want to play along: https://gist.github.com/jtiscione/2f7f614aba64d92354d674fcfd5e4305

It takes less than an hour to generate the mnist.onnx file. Once I have it I combine it with a ten line file (“zero”, “one”, “two”… “nine”) called mnist_labels.txt, and I run imagenet-camera.py basicallly like I said:

python3 imagenet-camera.py --model=mnist.onnx --input_blob=input_0 --output_blob=output_0 -labels=mnist_labels.txt

It pops up a window showing my face and saying it’s 100% positive I’m a “two”. (There are no “tens” in the training data, see.)

I don’t know if this would be expected to work, but I grabbed a bunch of index cards and wrote digits with a black Sharpie on them. If I hold them up to the camera, it’s either 100% sure of a two or (once in a while) 100% sure of an eight. The lighting in the room isn’t that great but I’m not sure what’s going on.

When you trained your MNIST model in PyTorch, what was the reported accuracy? If you can add support in your training script for evaluating on a test set, that would be good. It appears now to report the accuracy over the training set, which you typically would prefer to validate on another set that your model wasn’t directly trained on.

You could then take images from your camera of holding the index cards up, and see if your PyTorch script could successfully classify them in evaluation/test mode. That would allow you to know if your model is working well enough before trying it in TensorRT. Typically if your model accuracy isn’t high enough, it involves adding more images to the training set that are representative of the scenario you are using for inferencing.

At the end of training it was getting 98.76% accuracy on the training set; afterwards it checked the accuracy of the test set (see line 158) and got 98.82%.

That’s strictly in a 28x28 domain, though. Probably a better idea is preparing an image by hand and seeing what imagenet-console does with it.

I’m still confused about cropping; if the model is trained against a 28x28 image, how does this code present an 1280x720 image to it? Does it directly scale the image to 28x28 or does it crop 560 pixels off the edges first to match the aspect ratio of the model? Maybe I should be specifying identical width and height values as arguments?

(This is an unrelated problem, but no matter what width and height you specify, imagenet-camera opens a 3840x2160 black window with the image in the top left corner. If you try to adjust the size, the image gets cut off, because you can only reach the top edge of the window with your mouse.)

The jetson-inference library will downscale the input image directly to the resolution that the network expects (in this case 28x28). You might want to try the imagenet-console program on some test images at various resolutions first, before moving on to the camera app. That could allow you to identify if the pre-processing is an issue.

I can’t get it to work with even a simple MNIST model. If I draw a 28x28 image by hand of a digit and present it to imagenet-console.py, it reports it correctly, but anything else (like a 40x28 image) gets misclassified as “three” with 100% certainty, which makes no sense. I guess it must be looking at the 28x28 region in the top left corner or something.

It is down sampling to 28x28 with nearest-neighbor sampling. You can find the pre-processing code in jetson-inference/c/imageNet.cu if you require a different method for your model.

What are the width and height arguments for in imagenet-console? The documentation explains them as “the width” and “the height” which is kind of obvious, but it doesn’t say what they’re the width or height of, the network or the image.

I have a 28x28 MNIST network analyzing 280x280 pictures (I just drew the digits by hand). I don’t know if I’m supposed to specify --with=28 --height=28, or --width=280 --height=280, or just leave them off.

I made three scripts (using different width/height arguments) and tested each one of them. They all produce identical output, so it looks like the width and height arguments to imagenet-console.py are just ignored.

So I’m using the command ./jetson-inference/build/aarch64/bin/imagenet-console.py --model=./mnist.onnx --labels=./mnist_labels.txt --input_blob=input_0 --output_blob=output_0 ./five.jpg ./five-classified.jpg

It classifies all digits correctly except for 0, 2, and 9, which are all classified as “three”.

It also reports 100% confidence on 0, 1, 2, 3, 4, 5, 6, 7, and 8. For 9, it’s only 99.999988% certain that the 9 is “three”, which looks like roundoff error from 100%. How can it be 100% certain for any input? It shouldn’t be reporting 100% certainty for anything.

When I trained the model with Pytorch, it only reported 98.82% accuracy on the test set (see gist above). The model’s output tensor is always a ten-element array of values between 0 and 1. There are no 1.000 values.

I suspected it might be a softmax problem at the output, but I checked and I did incorporate a final softmax layer into the model. So I’m puzzled as to what’s going on here.

imagenet-console doesn’t have width and height arguments, but imagenet-camera does, and they set the resolution of the camera feed. They do not change the resolution of the network itself, that is dictated by the model that is loaded.

I’ve seen near-certain output before on Googlenet/Alexnet/Resnet models, so it doesn’t seem indicative of an error. My guess is your issue has something to do with the pre-processing applied.

I recall MNIST uses different mean values applied for mean pixel subtraction than the ImageNet dataset values used in the code:


Check if your PyTorch training script for MNIST uses different mean and standard deviation values for the pre-processing, and if needed change them to match in the code above. Remember to re-run ‘make’ followed by ‘sudo make install’ if you make changes.

I’m squinting at the calls to cudaPreImageNetNormMeanRGB() in the ONNX code block (line 414) and cudaPreImageNetMeanBGR() in the imageNet code block below it (line 427). There are a lot of hard-coded parameters being passed in but I can’t figure out what they are. I don’t see what mean and standard deviation it’s expecting for ImageNet.

The mean is on line 416 and the standard deviation is on line 417. These are the same constants used in the PyTorch ImageNet training script here:


Hi all, we’ve just posted a screencast tutorial for Hello AI World - check it out!

Realtime Object Detection in 10 Lines of Python Code on Jetson Nano

Hello Dusty,
Is there a way that I could run this on jupyter notebook?
I would like to see if this has better performance with headless mode.
Thank you!

Hi Luis, if you didn’t need the live video stream, it should be easy to run it within Jupyter notebook. To that end, you could run it in headless mode over SSH too. To view the camera video in Jupyter notebook, you could see the JetBot or JetRacer projects to see how they do it.

It runs OK in Jupyter. To see what’s in images, I’m converting to numpy and using matplotlib.

This is a test of coco-dog:


Actually I have a question, if you look at that notebook. When the things that detectnet is looking for are partially cut off by the edges of the image, it produces bounding boxes with some corners outside the image bounds. Is that its guess of where it would place the box if the image were simply wider?

I’m setting a min of zero and a max of width or height when I’m extracting the clip, but I don’t think that’s advisable- my goal here is to feed each clipped region to another (imagenet) classifier which has been trained against bounding-boxed images, and it would get confused by a truncated box.

I think the better route might be to toss the box entirely if it’s reported as out-of-bounds. The other alternative would be to pad it with zeroes, and include some partially zeroed-out images in the classifier’s training set.

If you print out the detections when this occurs, are the coordinates negative?

Yes- check out the notebook, it prints them out:

<detectNet.Detection object>
– ClassID: 0
– Confidence: 0.650689
– Left: 761.688
– Top: -30.375
– Right: 1009.12
– Bottom: 204.891
– Width: 247.438
– Height: 235.266
– Area: 58213.5
– Center: (885.406, 87.2578)
<detectNet.Detection object>
– ClassID: 0
– Confidence: 0.882428
– Left: 358.562
– Top: -11.3203
– Right: 561.375
– Bottom: 243.422
– Width: 202.812
– Height: 254.742
– Area: 51664.9
– Center: (459.969, 116.051)
<detectNet.Detection object>
– ClassID: 0
– Confidence: 0.526221
– Left: -59.8125
– Top: 22.2188
– Right: 163.375
– Bottom: 273.516
– Width: 223.188
– Height: 251.297
– Area: 56086.3
– Center: (51.7812, 147.867)

Because of the clumsy way I took the picture, the regions are hitting the top left, so you see negative numbers. But Right > width and Bottom > height are also conditions to watch out for.