Jetson AI Lab - ML DevOps, Containers, Core Inferencing

JETSON AI LAB RESEARCH GROUP

  • Project - ML DevOps, Containers, Core Inferencing
  • Team Leads - @dusty_nv, Mieszko Syty

This thread is to serve as updates and feedback for the jetson-containers build system, which provides the GPU-enabled packages for running ML/AI inference locally on Jetson. These containers not only power Jetson AI Lab, but are leveraged by many users across the ecosystem. Given the complexity, scale, and pace of modern ML/AI tools in the open-source community, ML DevOps is an important topic that underpins the rest of our development and the deployability of these systems into the field.

1 Like

4/2/24 - Rebuild the Jetson AI/ML stack for latest CUDA, Python, ect

  • NVIDIA has started providing CUDA, cuDNN (and soon TensorRT) installers for Jetson that are downloadable through our website and independent from the versions that ship with JetPack. There were many architectural enhancements in L4T and JetPack 6.0 to enable these out-of-band updates (such as upstreaming Orin SoC support into the mainline Linux kernel), ultimately enabling users to keep up-to-date with the latest CUDA versions and transitively leverage enhancements in downstream packages for these (such as PyTorch)

  • Upon installing a newer CUDA version (or Python version), the entire stack downstream also needs rebuilt - which is tedious, time consuming, and error-prone to do manually. For example, if you install the latest CUDA 12.4 and cuDNN 9.0 on top of JetPack 6.0 (which ships with CUDA 12.2 and cuDNN 8.9), PyTorch and everything after it needs recompiled against that new version of CUDA/cuDNN. And sometimes you may need to build a development branch of these packages where support for that newer CUDA is WIP (for example, PyTorch 2.2 won’t build against cuDNN 9.0, but PyTorch 2.3-rc does)

  • jetson-containers has been enhanced so that it will automatically rebuild all needed packages against the desired CUDA version, and resolving dependencies so that the correct versions of downstream packages are selected (for example, the system knows to use PyTorch 2.3 when CUDA 12.4 is selected). This behavior can be enabled by setting environment variables like CUDA_VERSION, CUDNN_VERSION, and PYTHON_VERSION:

    CUDA_VERSION=12.4 PYTHON_VERSION=3.11 \
    jetson-containers/build.sh --name my_cu124_container pytorch torchvision transformers ...
    

    New CUDA/cuDNN/TensorRT package versions still need to be defined in the container configurations (like here for CUDA, so it knows what installers to download). You can also set the default PYTORCH_VERSION like this, and it will do the reverse and select the correct CUDA version for you.

4/2/24 - Pip wheel server cache

  • In association with the points above, the number of container combinations increases exponentially, impacting the scalability of the build system to automatically redistribute binaries so not every user is spending hours/days recompiling these packages which are often tricky to build correctly.

  • jetson-containers now caches the pip wheels that it builds on a custom Pip server, which is used not only to install these packages into the deployment containers, but can be used by any Jetson user to install these packages natively even outside of container.

  • A prototype version of this pip server is running at http://jetson.webredirect.org/ , with wheels available for multiple CUDA versions dating back to JetPack 4.6 and CUDA 10.2. This index is automatically populated by the build farm of Jetson’s that I run locally.

  • You can have pip install these CUDA-enabled packages by setting --index-url or $PIP_INDEX_URL with your desired CUDA version (or by setting it persistently in your user’s pip.conf file). And temporarily --trusted-host or $PIP_TRUSTED_HOST also need set:

    export PIP_INDEX_URL=http://jetson.webredirect.org/jp6/cu122
    export PIP_TRUSTED_HOST=jetson.webredirect.org
    
    pip3 install pytorch torchvision torchaudio  # no more compiling of torchvision/torchaudio needed :)
    pip3 install transformers   # the correct PyTorch (with CUDA) will automatically be installed
    
  • This custom pip server mirrors the upstream PyPi server, so packages that aren’t in it will automatically be pulled from PyPi. However it shadows packages that jetson-containers builds with CUDA, so that when installing packages that depend on these CUDA-enabled packages (like how Transformers depends on PyTorch, but Transformers itself doesn’t require CUDA compilation) the correct version of that CUDA-enable package is installed from our Jetson-specific index.

  • When using anything that uses PyTorch, run sudo apt-get install libopenblas-dev libopenmpi-dev first, because the PyTorch wheels are built with USE_DISTRIBUTED=on (so that Jetson is able to run upstream PyTorch code that references the torch.distributed module, which is a common occurrence with open-source AI/ML projects even when running only one Jetson ‘node’)

2 Likes

4/2/24 - local_llm migration to NanoLLM project

  • The local_llm container provided support for the most optimized LLM inferencing APIs (such as MLC/TVM), in addition to many of the advanced demos and multimodal agents on Jetson AI Lab (such as llamaspeak and Live Llava).

  • However the size of its codebase and need for more detailed documentation outgrew being hosted directly inside jetson-containers, so its source was moved to https://github.com/dusty-nv/NanoLLM (and now that it supports SLMs and mini-VLMs, renamed to NanoLLM for consistency with our other libraries for Nano like NanoOWL, NanoSAM, NanoDB, ect)

  • jetson-containers still provides the dockerfiles and container builds for NanoLLM, and the code/containers for legacy local_llm will remain up for a while as the roll-out of NanoLLM progresses (although local_llm is now deprecated and any new features/fixes will be going into NanoLLM)

  • There is also now improved documentation, API references, and examples for using local_llm / NanoLLM now that it’s transitioning out of experimental phase. Find the docs here: dusty-nv.github.io/NanoLLM

1 Like

4/2/24 - TensorRT-LLM support on Jetson

  • As detailed in the posts above, now that we have access to latest CUDA and the ability to rebuild all the other downstream packages we need, we may be able to build mainline TensorRT-LLM (hopefully without much patching required). This is an ongoing effort in coordination with the TensorRT team that we are excited about, in order to provide edge-to-cloud compatibility with other NVIDIA production workflows, NeMo megatron models, and deploying NIM microservices to the edge.

  • TensorRT-LLM will be integrated into NanoLLM as another API backend, in addition to MLC. MLC/TVM already achieves greater than 95% peak Orin performance/efficiency on Llama (as shown in the Benchmarks on Jetson AI Lab), so performance-wise we’re already in a great place - however TensorRT-LLM will be good to have for the aforementioned compatibility reasons and production-grade support. For now, continue using the NanoLLM APIs to provide a seamless transition to TensorRT-LLM once it’s enabled, and to gain all the support for multimodality and I/O streaming in NanoLLM.

  • This is all subject to change regarding TensorRT-LLM depending on the outcomes of these ongoing engineering efforts. Once TensorRT 10 soon becomes available for Jetson, I will begin work shortly on attempting to compile the latest TensorRT-LLM for Jetson against CUDA 12.4 and TRT10. Assuming success, from there binaries can be provided through jetson-containers and the pip server, and further integration work with NanoLLM and other projects can proceed.

Hi everyone 👋

@dusty_nv, I must say, your work is fantastic!

I’ve been tinkering with my own git & CI/CD MLOps setup at home, constantly refining jetson-containers locally to achieve similar architecture. That’s why you’ve seen a flurry of PRs from me lately!

I’ve already begun testing the dev branch on my device and starting tomorrow, I’ll have even more time to delve deeper. Already, I’ve identified areas where we can make improvements and fix some broken Dockerfiles.

The mention of TensorRT-LLM support on Jetson is incredibly exciting. I’ve been dabbling with it myself here, but the idea of having pre-built wheels instead of building from scratch sounds like a lot more fun! 😄

1 Like

Thank you @narandill for all your involvement and PRs! I appreciate having someone else to work with on this stuff - I just merged jetson-containers dev branch into master (so that pip cache is now live in the system)

The next two weeks I’ll be heads-down updating our previous TRT-LLM attempts to the latest hopefully, preparing for JetPack 6.0 GA, finishing the migration of local_llm → NanoLLM, and hopefully finding time to integrate containers for ollama and streaming Piper TTS (the ollama one sounds like it should be straightforward now, if anyone wants to take a crack at that one)

Hi @narandill, were you actually able to run the TensorRT-LLM container you built against TensorRT 9.3.0.1 SBSA? I am not actually able to run tensorrt-9.3.0.1.ubuntu-22.04.aarch64-gnu.cuda-12.2.tar.gz , it hangs whenever you use TRT, because it was built for ARM64 SBSA (not Jetson)

Last night I managed to get TensorRT-LLM main mostly compiling against CUDA 12.4, but needed patches and failing with linker errors at the end. CUDA 12.2 I don’t have a compatible TRT9 for, and the TRT LLM repo has not been officially updated for CUDA 12.4 yet - I am checking about it internally, but we might have to wait a bit longer. Regardless, I have merged both of our dockerfile approaches in commit c05d51b

One cool thing to come from this experiment is in commit 6d78a9, I have adapted the 'requires' package configuration so that it can specify not only the required L4T version, but for CUDA and Python too - so you start seeing thing like below to apply the correct constraints (like requires=['==r36.*', '>=cu124']):

Hi @dusty_nv, I tried just to run basic test benchmarks for cpp & python from tensorrt-llm repo. The TensorRT 9.3.0.1 is installed using tensorrt-llm install script with small changes from my side to adjust to jetson.

The logs from tensorrt-llm container are gone from my local environment, but if it could be helpful I could run the build one more time to check if it works @dusty_nv ?

It’s a really nice thing you did with 'requires' :) Nice direction.

Did those python/cpp benchmarks complete successfully and build the engine? Because anytime I try to run trtexec from TRT 9.3.0.1 (the same one that the install script downloads), it hangs on any test onnx model right when it goes to start profiling kernels (which isn’t a surprised because it was built for SBSA). But when I switch to the non-SBSA TRT, it runs trtexec fine.

Another thing related to the ‘requires’, when you set CUDA_VERSION now, it will automatically append that version to the container tag (like tensorrt_llm:0.9.dev-r36.2.0-cu124) to help keep them all straight

Yes, all tests ware running fine. What is important to do is run the python test first, because they are generating mentioned engine for later usage with cpp tests, so in container Dockerfile the order should be as follows:

# test: [test_python_benchmark.sh, test_cpp_benchmark.sh]

@dusty_nv I’m running a fresh tensorrt-llm build from my Dockerfile, if you could hold on for some time I will post logs here when it’s done.

I like that consistency with new requires. This could give us potentially more flexibility :)

OK, so weird - if your rebuild goes and still works, I will then clone your fork directly and build that here (and then assuming that yours works on my Jetson, dig into what’s different). Also, if you have a dockerhub account and could push that image, that would be helpful for comparison purposes.

If you could also try running trtexec in your rebuilt TRT-LLM container on an ONNX model (this resnet-18 checkpoint is the one I’ve been using, but any should suffice with trtexec --onnx=model.onnx), that would be helpful to determine that trtexec sanity check may fail, but somehow TRT-LLM still works 😂🤯

For now I’m gonna shift gears back to merging updates for ollama, onnxruntime, and piper.

Ok @dusty_nv … so I deep dived into this topic and now I’m not so sure about my “successful” attempts 🫣

Builded the container, tests got stuck :D Now I wonder if I just skipped the tests building that last time using --skip-tests 🤔

If you could also try running trtexec in your rebuilt TRT-LLM container on an ONNX model

Did that without success, its stuck at this:

trtexec output
$ docker run -it ms1design/tensorrt_llm:r36.2.0-tensorrt_llm /bin/bash
...
root@75bc0fad1c1c:/test# python3 -c "import tensorrt; print('TensorRT version:', tensorrt.__version__)"
TensorRT version: 9.3.0.post12.dev1
root@75bc0fad1c1c:/test# /usr/local/tensorrt/bin/trtexec --onnx=model.onnx                             
&&&& RUNNING TensorRT.trtexec [TensorRT v9300] # /usr/local/tensorrt/bin/trtexec --onnx=model.onnx
[04/05/2024-19:37:58] [I] === Model Options ===
[04/05/2024-19:37:58] [I] Format: ONNX
[04/05/2024-19:37:58] [I] Model: model.onnx
[04/05/2024-19:37:58] [I] Output:
[04/05/2024-19:37:58] [I] === Build Options ===
[04/05/2024-19:37:58] [I] Max batch: explicit batch
[04/05/2024-19:37:58] [I] Memory Pools: workspace: default, dlaSRAM: default, dlaLocalDRAM: default, dlaGlobalDRAM: default
[04/05/2024-19:37:58] [I] minTiming: 1
[04/05/2024-19:37:58] [I] avgTiming: 8
[04/05/2024-19:37:58] [I] Precision: FP32
[04/05/2024-19:37:58] [I] LayerPrecisions: 
[04/05/2024-19:37:58] [I] Layer Device Types: 
[04/05/2024-19:37:58] [I] Calibration: 
[04/05/2024-19:37:58] [I] Refit: Disabled
[04/05/2024-19:37:58] [I] Weightless: Disabled
[04/05/2024-19:37:58] [I] Version Compatible: Disabled
[04/05/2024-19:37:58] [I] ONNX Native InstanceNorm: Disabled
[04/05/2024-19:37:58] [I] TensorRT runtime: full
[04/05/2024-19:37:58] [I] Lean DLL Path: 
[04/05/2024-19:37:58] [I] Tempfile Controls: { in_memory: allow, temporary: allow }
[04/05/2024-19:37:58] [I] Exclude Lean Runtime: Disabled
[04/05/2024-19:37:58] [I] Sparsity: Disabled
[04/05/2024-19:37:58] [I] Safe mode: Disabled
[04/05/2024-19:37:58] [I] Build DLA standalone loadable: Disabled
[04/05/2024-19:37:58] [I] Allow GPU fallback for DLA: Disabled
[04/05/2024-19:37:58] [I] DirectIO mode: Disabled
[04/05/2024-19:37:58] [I] Restricted mode: Disabled
[04/05/2024-19:37:58] [I] Skip inference: Disabled
[04/05/2024-19:37:58] [I] Save engine: 
[04/05/2024-19:37:58] [I] Load engine: 
[04/05/2024-19:37:58] [I] Profiling verbosity: 0
[04/05/2024-19:37:58] [I] Tactic sources: Using default tactic sources
[04/05/2024-19:37:58] [I] timingCacheMode: local
[04/05/2024-19:37:58] [I] timingCacheFile: 
[04/05/2024-19:37:58] [I] Enable Compilation Cache: Enabled
[04/05/2024-19:37:58] [I] errorOnTimingCacheMiss: Disabled
[04/05/2024-19:37:58] [I] Heuristic: Disabled
[04/05/2024-19:37:58] [I] Preview Features: Use default preview flags.
[04/05/2024-19:37:58] [I] MaxAuxStreams: -1
[04/05/2024-19:37:58] [I] BuilderOptimizationLevel: -1
[04/05/2024-19:37:58] [I] Calibration Profile Index: 0
[04/05/2024-19:37:58] [I] Input(s)s format: fp32:CHW
[04/05/2024-19:37:58] [I] Output(s)s format: fp32:CHW
[04/05/2024-19:37:58] [I] Input build shapes: model
[04/05/2024-19:37:58] [I] Input calibration shapes: model
[04/05/2024-19:37:58] [I] === System Options ===
[04/05/2024-19:37:58] [I] Device: 0
[04/05/2024-19:37:58] [I] DLACore: 
[04/05/2024-19:37:58] [I] Plugins:
[04/05/2024-19:37:58] [I] setPluginsToSerialize:
[04/05/2024-19:37:58] [I] dynamicPlugins:
[04/05/2024-19:37:58] [I] ignoreParsedPluginLibs: 0
[04/05/2024-19:37:58] [I] 
[04/05/2024-19:37:58] [I] === Inference Options ===
[04/05/2024-19:37:58] [I] Batch: Explicit
[04/05/2024-19:37:58] [I] Input inference shapes: model
[04/05/2024-19:37:58] [I] Iterations: 10
[04/05/2024-19:37:58] [I] Duration: 3s (+ 200ms warm up)
[04/05/2024-19:37:58] [I] Sleep time: 0ms
[04/05/2024-19:37:58] [I] Idle time: 0ms
[04/05/2024-19:37:58] [I] Inference Streams: 1
[04/05/2024-19:37:58] [I] ExposeDMA: Disabled
[04/05/2024-19:37:58] [I] Data transfers: Enabled
[04/05/2024-19:37:58] [I] Spin-wait: Disabled
[04/05/2024-19:37:58] [I] Multithreading: Disabled
[04/05/2024-19:37:58] [I] CUDA Graph: Disabled
[04/05/2024-19:37:58] [I] Separate profiling: Disabled
[04/05/2024-19:37:58] [I] Time Deserialize: Disabled
[04/05/2024-19:37:58] [I] Time Refit: Disabled
[04/05/2024-19:37:58] [I] NVTX verbosity: 0
[04/05/2024-19:37:58] [I] Persistent Cache Ratio: 0
[04/05/2024-19:37:58] [I] Optimization Profile Index: 0
[04/05/2024-19:37:58] [I] Inputs:
[04/05/2024-19:37:58] [I] === Reporting Options ===
[04/05/2024-19:37:58] [I] Verbose: Disabled
[04/05/2024-19:37:58] [I] Averages: 10 inferences
[04/05/2024-19:37:58] [I] Percentiles: 90,95,99
[04/05/2024-19:37:58] [I] Dump refittable layers:Disabled
[04/05/2024-19:37:58] [I] Dump output: Disabled
[04/05/2024-19:37:58] [I] Profile: Disabled
[04/05/2024-19:37:58] [I] Export timing to JSON file: 
[04/05/2024-19:37:58] [I] Export output to JSON file: 
[04/05/2024-19:37:58] [I] Export profile to JSON file: 
[04/05/2024-19:37:58] [I] 
[04/05/2024-19:37:58] [I] === Device Information ===
[04/05/2024-19:37:58] [I] Available Devices: 
[04/05/2024-19:37:58] [I]   Device 0: "Orin" UUID: GPU-bbbffbc5-6199-53c5-9e2e-968c02f36da7
[04/05/2024-19:37:58] [I] Selected Device: Orin
[04/05/2024-19:37:58] [I] Selected Device ID: 0
[04/05/2024-19:37:58] [I] Selected Device UUID: GPU-bbbffbc5-6199-53c5-9e2e-968c02f36da7
[04/05/2024-19:37:58] [I] Compute Capability: 8.7
[04/05/2024-19:37:58] [I] SMs: 16
[04/05/2024-19:37:58] [I] Device Global Memory: 62841 MiB
[04/05/2024-19:37:58] [I] Shared Memory per SM: 164 KiB
[04/05/2024-19:37:58] [I] Memory Bus Width: 256 bits (ECC disabled)
[04/05/2024-19:37:58] [I] Application Compute Clock Rate: 1.3 GHz
[04/05/2024-19:37:58] [I] Application Memory Clock Rate: 1.3 GHz
[04/05/2024-19:37:58] [I] 
[04/05/2024-19:37:58] [I] Note: The application clock rates do not reflect the actual clock rates that the GPU is currently running at.
[04/05/2024-19:37:58] [I] 
[04/05/2024-19:37:58] [I] TensorRT version: 9.3.0
[04/05/2024-19:37:58] [I] Loading standard plugins
[04/05/2024-19:37:58] [I] [TRT] [MemUsageChange] Init CUDA: CPU +2, GPU +0, now: CPU 16, GPU 34899 (MiB)
^C
root@75bc0fad1c1c:/test#  

I guess we need to wait a bit as you suggested before 🕯️

OK yep yep @narandill, no worries - thanks for confirming it on your end. At least we aren’t both going crazy now!

Let’s table this for a bit until next version of TRT-LLM comes out, which should explicitly support CUDA 12.4 I believe. Currently the TRT-LLM repo is on CUDA 12.2, which we don’t have TRT9/10 for (we do for CUDA 12.4). Ahhh compatibility matrices!

1 Like

4/8/2024 - New Container Images and Tool Usage

This past week these containers and images have been added, thanks everyone!

  • ollama
    • dustynv/ollama:r35.4.1
      dustynv/ollama:r36.2.0
  • piper-tts
    • dustynv/piper-tts:r35.4.1
      dustynv/piper-tts:r36.2.0
  • homeassistant-core
    • dustynv/homeassistant-core:r35.4.1
    • dustynv/homeassistant-core:r36.2.0

There is now an install script and jetson-containers launcher with improved syntactic sugar:

# install python requirements and link autotag and launcher under /usr/local/bin
git clone https://github.com/dusty-nv/jetson-containers
bash jetson-containers/install.sh

# run or build containers from any directory
jetson-containers run $(autotag ollama)
jetson-containers build --name=my_container l4t-text-generation

Also, documentation has been added for changing the CUDA version, along with other package versions and the role of the pip cache in accelerating builds.