Successful kernel tweaks to support Kubernetes and Calico on the Jetson Orin Nan

Hi all, I picked up a Jetson Orin Nano 8GB Dev kit about 6 months ago and it’s been sitting neglected in the box until this week when I finally decided to integrate it into my Raspberry Pi Cluster.

It wasn’t as straightforward as I thought it would be, and from my research I see there are quite a few other people who have struggled with kubernetes (or k3s, minikube etc) on the Orin Nanos. I’ve been successful in my approach, so I thought I’d share my process.

First here’s a bit of info about the cluster so you can see what I’m working with:

Hardware:

  • Control Plane: Raspberry Pi 5 16Gb
  • Worker Nodes: 4 x Raspberry Pi 4 8Gb
  • NFS Node: 1 x Raspberry Pi 5 8Gb with NVME hat (running 2 x 1Tb SSDs)
  • Router: Mikrotik Hex S
  • Switch: Digitus Gigabit PoE Switch
  • Jetson: Orin Nano 8Gb Dev Kit - 64gb microsd, no NVME yet :(

Software

  • k8s Version: Kubernetes v1.34.1
  • Gitops: ArgoCD
  • CNI: Callico
  • Ingress: nginx
  • Load balancer: metallb
  • Network Proxy: kube-proxy
  • Storage: Longhorn
  • Certs: cert-manager + Cloudflare.
  • Metrics: prometheus + grafana

Essentially… ArgoCD is my kubernetes controller and lets me version all my configuration in a git repo. Callico handles inter-pod communication, while kube-proxy handles service routing (TCP, UDP etc). metallb load balances between pods, cert manager issues certs signed by Cloudflare and ingress is handled by a simple nginx config. Longhorn lets me mount persistent volumes from the NFS on any of the other nodes. Prometheus collects metrics across the cluster and grafana is used to visualise them.

Anyway, onto the relevant stuff, getting the Orin Nano integrated. I’m afraid this is more of a report than a really in depth guide. I won’t be giving you every exact command to run, mainly because I don’t really have the patience to poke about in multiple bash history files. But there’s lots here so hopefully this information will still be useful

Jetson Initial Setup

My Jetson was BNIB, but it shipped with an r36 firmware so I was able to jump straight into Jetpack 6.2 without doing the 5 → 6 upgrade path. Disabling swap is a requirement for kubernetes, on the Jetson you have to disable zram swap. From there I was able to run the join command and saw it appear up on the cluster. CPU, RAM and Disk Pressure was being reported, and I started seeing a few daemonsets popping up on the node! When I launched Grafana, I could already see CPU temps being accurately reported in my dashboards!

Too easy! Or so I thought.

CNI and Calico

Then the problem appeared - my calico pods were failing. It wasn’t the usual issues - permissions, resources, bad volume mapping etc, it was an issue with the host node, not something wrong with the kube config. After much swearing and furious research it seemed to be a problem that other people had run into - the Tegra kernel was missing the modules Calico needs to write iptables. Urgh. I could run a pod via a NodeSelector, but without Calico the node would be severely limited.

I looked at Calico alternatives. Cillium would suffer from the same issue and while Flannel doesn’t need iptables it does far less than Calico and I’d need lots of other services to make up the shortfall. The cluster had been running wonderfully for a long time with Callico so I didn’t really want to migrate away from it.

It would probably be less hassle to recompile the linux kernel, I thought. So I did! And it was!

Enabling kernal modules in Linux_for_Tegra

I am a software engineer but I’m no kernel expert, and I’m certainly no Tegra expert. So this was relatively uncharted waters for me. I have to give it up to gpt-5.2-codex for holding my hand through some of this, particularly in identifying which parts of the kernel I needed to add. So don’t get it twisted, I am not some eldritch wizard. I know some spells, but so do you, and if I can do this you can too.

So the good news is that if you’re running the latest firmware you probably don’t actually have to recompile the kernel, you can just compile the missing kernel modules and then activate them. You don’t even need to restart! The process is as follows:

  1. Download the source code for your firmware. If you’re running the latest firmware this is most likely Jetson Linux 36.4.4. You might be running a more recent patch release (I am actually running 36.4.7), but this is the latest source release and good enough for our purposes.

  2. Unzip it and you should have a Linux_for_Tegra folder. I would recommend compiling it on the Jetson, that way you don’t have to account for differing architectures, so copy it over to the jetson and make sure you have the prerequisites installed. Following the developer guide, we still need the sources sync’d so make sure you do that.

  3. Once the sources have syncd you should be able to cd down into something like /source/kernel/kernel-jammy-src/ - this is where we will make some minor config changes that will tell the compiler to compile the modules we need.

  4. You need a good config to start. If certain values in your configuration don’t match the running firmware, the modules we compile will be rejected and won’t mount. The best way to handle this I found was to copy the config of the running firmware, using something like zcat /proc/config.gz > Linux_for_Tegra/source/kernel/kernel-jammy-src/.config We can then apply a few changes on top of that to minimise any mismatches.

  5. With that config in place we can toggle some modules. Updating these options will enable the ip_set modules we need for calico:

    scripts/config --enable IP_SET
    scripts/config --module IP_SET_HASH_IP
    scripts/config --module IP_SET_HASH_NETPORTNET
    scripts/config --module IP_SET_HASH_NET
    scripts/config --module NETFILTER_XT_TARGET_CT
    scripts/config --module NETFILTER_XT_MATCH_RPFILTER
    scripts/config --module IP_NF_MATCH_RPFILTER
    scripts/config --module IP6_NF_MATCH_RPFILTER
    scripts/config --module NETFILTER_NETLINK_LOG
    
  6. In addition I had some issues getting the firmware release tag to match up. My firmware wanted the modules to be compiled with the tag `5.15.148-tegra` but mine were coming out under `5.15.148-prod`. I updated the config like so:

    scripts/config --disable LOCALVERSION_AUTO
    scripts/config --set-str LOCALVERSION "-tegra"
    
  7. We should now be able to apply our updated config by running make olddefconfig and you should get a message that the config has been applied. From there we can compile the modules using make -j"$(nproc)" modules

  8. With the modules compiled, we can install them into a folder and then cherry pick the bits we need into our running firmware. Running sudo make INSTALL_MOD_PATH=/tmp/tegra-mods modules_install will stick them in the /tmp directory.

  9. That should generate a load of files with the following directory structure: /tmp/tegra-mods/lib/modules/5.15.148-tegra/kernel/… You can see the release tag we set in step 6 in these file paths. Our modules need to move into /lib/modules/5.15.148-tegra/kernel/… where the running kernel resides. Here’s a list of all the new modules you’ll need:

    /lib/modules/5.15.148-tegra/kernel/net/netfilter/ipset/ip_set.ko
    /lib/modules/5.15.148-tegra/kernel/net/netfilter/ipset/ip_set_hash_ip.ko
    /lib/modules/5.15.148-tegra/kernel/net/netfilter/ipset/ip_set_hash_netportnet.ko
    /lib/modules/5.15.148-tegra/kernel/net/netfilter/ipset/ip_set_hash_net.ko
    
    /lib/modules/5.15.148-tegra/kernel/net/netfilter/xt_CT.ko
    /lib/modules/5.15.148-tegra/kernel/net/netfilter/xt_NFLOG.ko
    /lib/modules/5.15.148-tegra/kernel/net/netfilter/nfnetlink_log.ko
    
    /lib/modules/5.15.148-tegra/kernel/net/ipv4/netfilter/ipt_rpfilter.ko
    /lib/modules/5.15.148-tegra/kernel/net/ipv6/netfilter/ip6t_rpfilter.ko
    

    Copy them from /tmp/tegra_mods/ into /lib/modules/ and run sudo depmod -A.

  10. If everything has worked, the modules will be activated and your calico-pods should no longer fail. Use modprobe to see if the modules have loaded. Pods can now be scheduled on the Jetson! If it doesn’t work, try swearing a lot and asking your favourite LLM for help. It’s probably a version or release-tag mismatch that is giving you grief.

  11. To load these changes on boot I created /etc/modules-load.d/calico-netfilter.conf and it just contains a list of all the modules we need. Some of these were already ni the right place, just not activated IIRC:

    ip_set_hash_ip
    ip_set_hash_netportnet
    iptable_raw
    iptable_filter
    iptable_mangle
    iptable_nat
    x_tables
    xt_set
    xt_conntrack
    xt_comment
    xt_mark
    xt_CT
    xt_NFLOG
    nfnetlink_log
    ipt_rpfilter
    ip6t_rpfilter
    ip_set_hash_net
    

NVIDIA Device Plugin

With pods schedulable on the Jetson, the next step was getting the NVIDIA Device plugin running. I deployed this in argo as a daemonset. It also required some changes to the Jetson: increasing inotify limits and setting the default runtime to nvidia in /etc/containerd/config.toml. You also need to make sure the nvidia-container-toolkit is installed and available.

It works!

But it’s not without issue.

Kube sees the Jetson as a single GPU resource available for scheduling so a greedy pod will sit on it indefinitely. The jetson-copilot deploy does this, so any other containers that need a GPU can’t get scheduled unless I scale it back. This is a kube thing rather than a jetson thing, and I’ll be looking to use containers that can handle GPU workloads as jobs rather than evergreen pods.

I also desperately need to add NVME storage. The 64gb microsd isn’t cutting it, and is definitely limiting the work I can do with larger models and samples. One for when the bank balance allows.

Benchmark

Here’s some results from a little test container I got ChatGPT to write. We have some TFLOPS, just not that many.

I’d be very interested in running some more suitable benchmarks created by knowledgeable humans, so if anyone has any suggestions I’d love to hear them!

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Aug_15_22:08:11_PDT_2023
Cuda compilation tools, release 12.2, V12.2.140
Build cuda_12.2.r12.2/compiler.33191640_0
device=Orin
sm=8.7 driver=12060 runtime=12020
global_mem_MB=7620
axpy: repeat 1/3
axpy: repeat 2/3
axpy: repeat 3/3
axpy: N=16777216 iters=400 warmup=10 repeats=3 time_ms=957.397 bandwidth_GBps=84.11
sgemm: repeat 1/3
sgemm: repeat 2/3
sgemm: repeat 3/3
sgemm: M=4096 N=4096 K=4096 iters=30 warmup=5 repeats=3 time_ms=3385.707 TFLOPS=1.22
sgemm2: repeat 1/3
sgemm2: repeat 2/3
sgemm2: repeat 3/3
sgemm2: M=3072 N=3072 K=3072 iters=40 warmup=5 repeats=3 time_ms=1935.897 TFLOPS=1.20
tf32: repeat 1/3
tf32: repeat 2/3
tf32: repeat 3/3
tf32: M=6144 N=6144 K=6144 iters=40 warmup=5 repeats=3 time_ms=4471.029 TFLOPS=4.15

Here’s some red hot pics for your viewing pleasure:

The lack of an NVME drive is very apparent here!

Thermals over 24 hours. The jetson is the yellow line - first spikes are some simple CUDA tests, the second sustained increase is jetson-copilot doing some RAG training on mxbai-embed-large. I’m happy with those CPU temps under load, the GPU temp was stable at about 61° I just haven’t added it to prom/graf yet.

Thanks for reading, let me know any questions or recommendations you might have!

Here’s a pic of the cluster itself, for posterity. ;)

The Jetson is the 3u unit above the router. The little router stting on top is the main gateway for the house.

And yes, the little gap under the brush plate really bugs me.

Hi,
Thanks for the sharing. We have the suggestion in
Kubernetes on Jetson Orin Nanos - #15 by AastaLLL

Looks like it does not need to disable servicelb in your approach. Really appreciate the sharing.