Jetson AGX Xavier Deep Learning Inference Benchmarks

Hi all, we’ve published a comprehensive set of deep learning inference performance and energy efficiency benchmarks with Jetson AGX Xavier.

See here for the results — [b][url]https://developer.nvidia.com/embedded/jetson-agx-xavier-dl-inference-benchmarks[/url][/b]

NVIDIA will continue improving the performance with sw optimizations and feature enhancements in future releases of JetPack.
Included in the results above, we’ve also provided estimates of the future performance incorporating these improvements.

Note: data from TX1/TX2 is available for comparision here — [b][url]https://www.nvidia.com/en-us/data-center/resources/inference-technical-overview/[/url][/b]

We’ve posted a technical blog on the Jetson AGX Xavier architecture, including an in-depth analysis of the benchmarking results.

Check it out here — [b][url]https://devblogs.nvidia.com/nvidia-jetson-agx-xavier-32-teraops-ai-robotics/[/url][/b]

The link to the inference benchmarks simply says to use ./trtexec (with options). This is not helpful. Unless I missed something, or perhaps it was different in previous Jetpack versions (I’m using 4.2), I had to figure out where this executable was located.

I found /usr/src/tensorrt/samples/trtexec and did “sudo make”, without first checking if there was already a /usr/src/tensorrt/bin/ . When I did the make, it complained about:

../Makefile.config:5: CUDA_INSTALL_DIR variable is not specified, using /usr/local/cuda by default, use CUDA_INSTALL_DIR=<cuda_directory> to change.
../Makefile.config:8: CUDNN_INSTALL_DIR variable is not specified, using $CUDA_INSTALL_DIR by default, use CUDNN_INSTALL_DIR=<cudnn_directory> to change.

but it finished the compile and now the executable is available in /bin. So I don’t know if it compiled correctly.

Perhaps it would be great if this was pre-compiled (with the correct links to the cuda and cuDNN install paths), and the command was already in the PATH so that “trtexec” could just be used in the terminal from anywhere, like nvpmodel and jetson_clocks are (now? they are for me in 4.2).

EDIT: well actually it looks like it doesn’t matter, because every time I try to run the ./trtexec from /usr/src/tensorrt/bin/ , it always says

Could not open file XXXX
CaffeParser: Could not parse deploy file

where I’ve tried for example “…/data/googlenet.prototxt”. I can definitely see some things in /usr/src/tensorrt/data/, but it doesn’t seem to be working.

Would be great if the inference benchmark documentation could be updated for Jetpack 4.2. Or maybe now it’s supposed to be run from the Python API?? Again, documentation …

The trtexec binary does come pre-compiled. It’s located in /usr/src/tensorrt/bin

For JetPack 4.2, the correct path would be “…/data/googlenet/googlenet.prototxt”. For example:

$ cd /usr/src/tensorrt/bin
$ ./trtexec --avgRuns=100 --deploy=../data/googlenet/googlenet.prototxt --fp16 --iterations=1000 --output=prob

Thanks!

So have I screwed up the pre-compiled binary by running the makefile in samples/trtexec, because of the error messages it showed?

It would be great if the instructions on the webpage simply showed where you needed to navigate to in order to execute trtexec.

And in Jetpack 4.2, obviously the file structure changed because GoogleNet is the only .prototex file located in …/data, there isn’t anything for ResNet18 FCN, ResNet50, etc. that are shown on the web page.

Better yet, I was just trying to execute something to verify my Jetson was functioning well out of the box. Would be great if perhaps there was a “jetson_startup_test” that could be run from anywhere in the terminal, that would run pre-compiled code testing cuda, cuDNN, and tensorRT.

OK, so at least for running the GoogleNet, how do I tell if I’m meeting the benchmark? The only output from running the command is something like:

“Average over 1 runs is 4.03xxxxxx ms (host walltime is 4.06xxxx ms, 99% percentaile time is 4.03xxxx ms”

That’s roughly the lowest the time goes regardless of parameters, if the batch is 1. Running more iterations or avgRuns just makes it take longer to finish.

Does this mean my Jetson is running 4x slower than it’s supposed to? Is this supposed to be run from a terminal before the GUI boots up?

Thanks!

If it didn’t compile, then the binary wouldn’t have been overwritten, so you should be fine.

To get the images per second, take 1000 divided by the time reported here. You should launch trt-exec to measure over many runs (like in the example commands). The first run is typically slow because the clock frequency governor needs time to spin up the clocks (or you can run ‘sudo jetson_clocks’ beforehand to disable to frequency governor).

The benchmarking numbers on the website are reported from concurrently running GPU (INT8) and two DLA’s (FP16). Hence those trt-exec commands can be run at the same time and the images per second added together.

Hi dusty_nv, thanks for the replies and continued patience!

It did compile, which was the worrisome part for me. Not so much on cuda, because the fallback in the warning message was in fact correct, but for cuDNN, the libcuDNN are not installed in /usr/local/cuda, rather they’re in /usr/lib/aarch64-linux-gnu/

So I simply added the following to the bottom of my .bashrc in home:

export CUDA_INSTALL_DIR=/usr/local/cuda
export CUDNN_INSTALL_DIR=/usr/lib/aarch64-linux-gnu

I recompiled trtexec, and also compiled sampleGoogleNet. Both compiled fine, no warnings. And I used sudo jetson_clocks before running anything in bin/.

The trtexec definitely has better performance now, though oddly enough it seems to take much longer to print to terminal. I’m still just doing avgRuns=1 because I wanted it to print fast. When I set it to int8 with 10 iterations, they were averaging now more like 1.8-1.9 msec.

However, the sample_googlenet does not seem to be working correctly, even though I’m giving it the correct address to the googlenet.prototxt. Its output just says:

Building and running a GPU inference engine for GoogleNet
Ran ./sample_googlenet with 
Input(s): data 
Output(s): prob 
Done.

which is much different than what is shown at this webpage: https://docs.nvidia.com/deeplearning/sdk/tensorrt-sample-support-guide/index.html#googlenet_sample

Any final suggestions?

Hi leka004, I get that same output with JetPack 4.2. I think the sample must have been updated for TensorRT 5.1.5, which is the version that those online docs reference. The version of TensorRT in JetPack 4.2 is TensorRT 5.0.6

Thanks! Perhaps in JP 4.2.1 it will have TRT 5.1.5 ?

Yes, it actually has TRT 5.1.6.

Great! So in order to get JP 4.2.1 on my Jetson AGXX, I will need to re-flash it from a host PC running Ubuntu 18.04 (btw, this is an unfortunate requirement, as many of my linux machines are already past that … I hope newer versions of SDKM can accept higher versions of Ubuntu). And I’ll lose anything I’ve got on there now, so I should backup any codes or programs? Thanks again!

That’s correct, you’ll need to back up anything you want to save. We have plans to move to OTA in-place upgrades in the future, but for now re-flashing for a new JetPack is still required.

Hi dusty_nv,

When jetpack 4.2.1 is going to be released?
Can you please share any tentative timelines?
Waiting for int8-DLA support.

Hi BMohit, we are aiming to release it next week, stay tuned.

Hi dusty_nv,

We got below the difference in results for below networks :-

We used Data type fp16, Input shape ( 1,3,224,224 ), torch2trt

1.) Alexnet ( our Throughput recorded 146 Vs Standard results in Nvidia website 565 )

2.) Squeezenet 1.0 ( our Throughput recorded 111 Vs Standard results in Nvidia website 121 )

3.) Squeezenet 1.1 ( our Throughput 115 recorded Vs Standard results in Nvidia website 125 )

4.) Resnet18 ( our Throughput recorded 349 Vs Standard results in Nvidia website 722 )

5.) Resnet34 ( our Throughput recorded 249 Vs Standard results in Nvidia website 396 )

6.) Resnet50 ( our Throughput recorded 227 Vs Standard results in Nvidia website 326 )

7.) Resnet101 ( our Throughput recorded 84.7 Vs Standard results in Nvidia website 175 )

8.) Restnet152 ( our Throughput recorded 122 Vs Standard results in Nvidia website 122 )

9.) Densenet121 ( our Throughput recorded 162 Vs Standard results in Nvidia website 76.6 )

Below is the model details :-

We are using Jetson AGX Xavier and Jetpack 4.2.2 and TensorRT 5.1.6

Thanks,

Hi bajpai9, were you running your Xavier in MAX-N mode (sudo nvpmodel -m 0) and had you also run the jetson_clocks script beforehand?

I would open an issue on the torch2trt Issues tab so the maintainer of that project can respond, thanks. This thread is about the official benchmarks.