K3S agent on Orin AGX

Hi,

I’m running Jetpack 6.2.1 on a Orin AGX Devkit.

It’s part of a K3s cluster, using containerd 1.7.27 as the container run time, and the nvidia container toolkit.

Most of my pods are working correctly, I’d like to use Azure Monitor though, and some of my pods report that.

kubectl logs ama-logs-rrjdh -n kube-system
Defaulted container "addon-token-adapter" out of: addon-token-adapter, ama-logs, ama-logs-prometheus
MICROSOFT SOFTWARE LICENSE TERMS
MICROSOFT Azure Arc for Kubernetes
__________________________________
This software is licensed to you as part of your or your company's subscription license for Microsoft Azure Services. You may only use the software with Microsoft Azure Services and subject to the terms and conditions of the agreement under which you obtained Microsoft Azure Services. If you do not have an active subscription license for Microsoft Azure Services, you may not use the software. Microsoft Azure Legal Information: https://azure.microsoft.com/en-us/support/legal/
F1030 15:26:27.503782       1 main.go:61] error modifying iptable rules: error adding rules to custom chain: running [/usr/sbin/iptables -t nat -A aad-metadata -p tcp -d 169.254.169.254 --dport 80 -j DNAT --to-destination 127.0.0.1:8421 --wait]: exit status 4: Warning: Extension tcp revision 0 not supported, missing kernel module?
Warning: Extension DNAT revision 0 not supported, missing kernel module?
iptables v1.8.10 (nf_tables):  RULE_APPEND failed (No such file or directory): rule in chain aad-metadata

Would you have any fix on that? I’ve seen there were similar issues with Docker that got solved, but couldn’t find anything with containerd.

*** Please note that this reply is generated by LLM automatically ***

Based on the provided context, I’ll attempt to provide a comprehensive answer to the question.

The question is about an issue with Azure Monitor on a Kubernetes (K3s) cluster using containerd as the container runtime on a Jetson AGX Orin Devkit running Jetpack 6.2.1. The user is experiencing issues with some pods reporting errors related to Azure Monitor, and they are looking for a solution.

After reviewing the provided context, I found some relevant information that might be helpful in resolving the issue:

  1. Similar issues with Docker: The user mentioned that similar issues with Docker were solved, but they couldn’t find anything related to containerd. This suggests that the issue might be related to the container runtime, and a solution might be found by comparing the Docker and containerd configurations.
  2. NVIDIA Container Toolkit: The user is using the NVIDIA Container Toolkit, which is a set of tools for building and running GPU-accelerated containers. This toolkit might be relevant to the issue, as it provides a way to manage GPU resources and optimize container performance.
  3. Containerd version: The user is using containerd 1.7.2, which is a relatively recent version. However, it’s possible that there are known issues or bugs in this version that could be contributing to the problem.

To troubleshoot the issue, I would recommend the following steps:

  1. Check the containerd logs: Review the containerd logs to see if there are any error messages related to Azure Monitor or the NVIDIA Container Toolkit.
  2. Compare Docker and containerd configurations: Compare the Docker and containerd configurations to see if there are any differences that could be contributing to the issue.
  3. Check the NVIDIA Container Toolkit documentation: Review the NVIDIA Container Toolkit documentation to see if there are any known issues or limitations related to Azure Monitor or containerd.
  4. Try a different container runtime: If possible, try using a different container runtime, such as Docker, to see if the issue persists.
  5. Check the K3s cluster configuration: Review the K3s cluster configuration to ensure that it is properly configured and that there are no issues with the cluster that could be contributing to the problem.

In terms of a specific fix, I couldn’t find any information in the provided context that would suggest a straightforward solution. However, by following the troubleshooting steps outlined above, the user may be able to identify the root cause of the issue and find a solution.

It’s also worth noting that the user may want to consider reaching out to NVIDIA support or the K3s community for further assistance, as they may have more expertise and knowledge about the specific issue.

*** Please give the thumbs up if you get answers you like. Or provide feedback to help us improve the answer. ***

I’ve already opened a ticket on Azure, to have their POV on that issue.

I’ve already tried those steps.

Note: I’ve spent some time doing my research to troubleshoot this issue, reading plenty of docs, and the forum, not sure if replacing an LLM by another one is the best shot at solving that…

Hi,

Yes, this is a known issue as the nf_tables doesn’t work on JetPack 6.2.
We try to fallback the iptables to legacy but fail. So need to disable the servicelb in the end.

Please find more information in the link below:

Thanks.

I tried that, and I’m still getting the same error, do you kow if it works with containerd as a container runtime? Only the svclb-traefik is not there anymore.

kubectl get pods -n kube-system
NAME                                                  READY   STATUS             RESTARTS        AGE
ama-logs-jxs8w                                        3/3     Running            1 (75m ago)     79m
ama-logs-qdlpc                                        2/3     CrashLoopBackOff   34 (61s ago)    79m
ama-logs-rs-6d4cb76db6-pxt7m                          1/2     CrashLoopBackOff   20 (71s ago)    79m
ama-metrics-569bb9f4c7-2wtg5                          1/2     CrashLoopBackOff   24 (85s ago)    79m
ama-metrics-569bb9f4c7-nx484                          1/2     CrashLoopBackOff   24 (116s ago)   79m
ama-metrics-ksm-6c9746f9bb-znv9b                      1/1     Running            0               79m
ama-metrics-node-kfbp2                                2/2     Running            1 (77m ago)     79m
ama-metrics-node-nczst                                1/2     CrashLoopBackOff   43 (21s ago)    79m
ama-metrics-operator-targets-6856cb96f5-66z98         2/2     Running            3 (77m ago)     79m
azuremonitor-metrics-prometheus-node-exporter-jjw7z   1/1     Running            0               79m
azuremonitor-metrics-prometheus-node-exporter-l75pf   1/1     Running            0               79m
coredns-64fd4b4794-v7zp5                              1/1     Running            0               92m
helm-install-traefik-crd-24b8d                        0/1     Completed          0               92m
helm-install-traefik-zphcn                            0/1     Completed          1               92m
local-path-provisioner-774c6665dc-swvld               1/1     Running            0               92m
metrics-server-7bfffcd44-5xdpw                        1/1     Running            0               92m
nvidia-device-plugin-daemonset-5cs8z                  1/1     Running            0               87m
nvidia-device-plugin-daemonset-mnbdt                  1/1     Running            0               87m
traefik-c98fdf6fb-4mf4l                               1/1     Running            0               92m

It didn’t change anything on the other pods, still getting this log for all the ones in CrashLoopBackOff.

12:12:03.691821       1 main.go:61] error modifying iptable rules: error adding rules to custom chain: running [/usr/sbin/iptables -t nat -A aad-metadata -p tcp -d 169.254.169.254 --dport 80 -j DNAT --to-destination 127.0.0.1:8421 --wait]: exit status 4: Warning: Extension tcp revision 0 not supported, missing kernel module?
Warning: Extension DNAT revision 0 not supported, missing kernel module?
iptables v1.8.10 (nf_tables):  RULE_APPEND failed (No such file or directory): rule in chain aad-metadata

Could recompiling the kernel with the flags mentioned there solve my issues?

Some progress, I recompiled the kernel with those options

CONFIG_NFT_MASQ=m
CONFIG_NFT_NAT=m
CONFIG_NFT_COMPAT=m
CONFIG_NETFILTER_XTABLES=m

Getting some improvements

kubectl get pods -n kube-system
NAME                                                  READY   STATUS             RESTARTS       AGE
ama-logs-2kq8v                                        2/3     CrashLoopBackOff   7 (15h ago)    16h
ama-logs-jxs8w                                        3/3     Running            4 (19h ago)    22h
ama-logs-rs-6d4cb76db6-tznnj                          2/2     Running            0              16h
ama-metrics-569bb9f4c7-njk5r                          2/2     Running            0              16h
ama-metrics-569bb9f4c7-qkt2s                          2/2     Running            0              16h
ama-metrics-ksm-6c9746f9bb-m58ll                      1/1     Running            0              16h
ama-metrics-node-bdwl2                                1/2     CrashLoopBackOff   14 (15h ago)   16h
ama-metrics-node-kfbp2                                2/2     Running            3 (19h ago)    22h
ama-metrics-operator-targets-6856cb96f5-9wn92         2/2     Running            2 (16h ago)    16h
azuremonitor-metrics-prometheus-node-exporter-l44r8   1/1     Running            0              16h
azuremonitor-metrics-prometheus-node-exporter-l75pf   1/1     Running            1 (19h ago)    22h
coredns-64fd4b4794-v7zp5                              1/1     Running            1 (19h ago)    22h
helm-install-traefik-crd-24b8d                        0/1     Completed          0              22h
helm-install-traefik-zphcn                            0/1     Completed          1              22h
local-path-provisioner-774c6665dc-swvld               1/1     Running            2 (19h ago)    22h
metrics-server-7bfffcd44-5xdpw                        1/1     Running            1 (19h ago)    22h
nvidia-device-plugin-daemonset-6s2tz                  1/1     Running            0              16h
nvidia-device-plugin-daemonset-mnbdt                  1/1     Running            1 (19h ago)    22h
traefik-c98fdf6fb-4mf4l                               1/1     Running            1 (19h ago)    22h

The pods that are still in CrashLoopBackOff display

F1111 09:07:37.358691       1 main.go:61] error modifying iptable rules: error adding rules to custom chain: running [/usr/sbin/iptables -t nat -A aad-metadata -p tcp -d 169.254.169.254 --dport 80 -j DNAT --to-destination 127.0.0.1:8421 --wait]: exit status 4: Warning: Extension tcp revision 0 not supported, missing kernel module?
Warning: Extension DNAT revision 0 not supported, missing kernel module?
iptables v1.8.10 (nf_tables):  RULE_APPEND failed (No such file or directory): rule in chain aad-metadata

If you have any idea, I’m keen

Hi,

If you have another service that requires iptables, could you try to fallback it to legacy or disable the functionality?

Thanks.

Doesn’t seem to be possible, after some time spent troubleshooting, I don’t think it’s a Jetson issue, but a Azure related one.

Thank you for your time

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.