Bad Jetpack5 nvme writing performance compared with Jetpack4

Hi folks,
I migrated to Jetpack 5.1.2 where it has kernel 5.10.104 and the nvme writing performance is very bad compared with previous Jetpack 4 with kernel 4.9.253. It is specifically very slow when writing into nvme.

JP4 Linux 4.9.253
# dd if=/dev/zero of=/media/nvme/output bs=8k count=10k
10240+0 records in
10240+0 records out
83886080 bytes (84 MB, 80 MiB) copied, 0.0733814 s, 1.1 GB/s

JP5 Linux 5.10.104
# dd if=/dev/zero of=/media/nvme/output bs=8k count=10k
10240+0 records in
10240+0 records out
83886080 bytes (84 MB, 80 MiB) copied, 0.268189 s, 313 MB/s

Does somebody has any idea?
I found this forum Re: NVME performance regression in Linux 5.x due to lack of block level IO queueing - Michael Marod where it says the problem was solved on kernel 5.17, but even jetpack6 will not have that kernel as mentioned in nvidia roadmap

I observed the same thing. Is there any solution for that?

Please try the suggestion in:
New SSD , slow write rate - #8 by DaneLLL

You can enable O_DIRECT flag in the application.

Hi @DaneLLL , thank you for your suggestion.
I did the test with the program you mentioned and these are the result:

#JP4 Linux 4.9.254
# /tmp/nvme 
ret = 0
Direct read: total_bytes_read=2147483648 time=869 ms throughput=2356.731876
ret = 0
Buffered read: total_bytes_read=2147483648 time=1387 ms throughput=1476.568133
Direct write: total_bytes_writen=64424509440.000000  **time=37842 ms** throughput=1623.592833
Buffered write: total_bytes_writen=64424509440.000000 **time=45639 ms** throughput=1346.217051

#JP5 Linux 5.10.104
# /tmp/nvme 
ret = 0
Direct read: total_bytes_read=2147483648 time=1380 ms throughput=1484.057971
ret = 0
Buffered read: total_bytes_read=2147483648 time=1440 ms throughput=1422.222222
Direct write: total_bytes_writen=64424509440.000000  **time=37916 ms** throughput=1620.424095
Buffered write: total_bytes_writen=64424509440.000000 time=**100047 ms** throughput=614.111368

Ok, I agree direct access has a better performance, but it create the issue of data loss in case a sudden power loss.
O_DIRECT can “only” be used in my own application. For example when using tcpdump to record a PCAP file, it also has a wild difference of CPU usage and nvme writing speed compared with JP4 kernel 4.9.
Another info, when using latest JetPack 5.1.2 with kernel 5.10.120, the performance is even worst. I’m running on Jetson Xavier AGX.

Direct access seems to help only with big chunks of data:

dd if=/dev/urandom of=/media/nvme/output **bs=1024k count=1k** oflag=direct
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 4.17123 s, **257 MB/s**


dd if=/dev/urandom of=/media/nvme/output **bs=8k count=128k** oflag=direct
131072+0 records in
131072+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 12.012 s, **89.4 MB/s**

Do you (or somebody) have any other insight?
Bests, Diogo

Do you know if the fix can be applied to K5.10 for a try? Our hardware can achieve the performance if upstream kernel is good.

Or please use Jetpack 4 release. K4.9 should not have the issue.

The point is that I don’t have the fix yet. I’m assuming as described here that this issue is gone on kernel 5.17. Also can’t simply get the mainstream kernel as Nvidia applies a lot of changes on top. Is the work in progress Linux for Jetpack6 already available? I can’t find it here

About downgrading to JetPack4, I can’t do it as it is reaching end of life JetPack 4 Reaches End of Life and I need the latest libraries like VPI 2.2 for the perception stack.

As mentioned in the Nvidia Roadmap Jetpack 6 is coming with kernel 5.15. What about the simple comparison between JP4 and JP6 running a basic dd command into nvme? How is it performing on JetPack6? Does nvidia have any workaround to fix this?

Could you help check K5.17 and share us which commits may be potential fix for the issue? It would be great if you can share us more information so that we can discuss with our teams. Ideally we would like to keep upstream kernel as is.

Hi @DaneLLL , I have more test results to share.
I built multiples kernel versions from kernel/git/stable/linux.git - Linux kernel stable tree and flashed could run in the board.

kernel 5.7, 5.8, 5.9, 5.17, 5.18, 5.19: All have the bad nvme writing performance. kernels bellow 5.7 are not booting my target (AGX XAVIER).
kernel 6.0: bad nvme writing performance.
I used tegra194-p2972-0000.dtb as device tree and the standard /arch/arm64/configs/defconfig.

I’m trying to build and run any mainstream kernel 4.X, but i’m still not able to boot the target, sometimes a kernel panic, sometimes just freezing.

I double checked the performance kernel 6.0 vs 4.9.253, using the same nvme, same nvme partition table, etc:

# 6.0.0
dd if=/dev/zero of=/media/nvme/output bs=8k count=256k
262144+0 records in
262144+0 records out
2147483648 bytes (2.1 GB, 2.0 GiB) copied, 9.26979 s, 232 MB/s

dd if=/dev/zero of=/media/nvme/output bs=8k count=256k
262144+0 records in
262144+0 records out
2147483648 bytes (2.1 GB, 2.0 GiB) copied, 1.665 s, 1.3 GB/s

Thanks for the sharing. The result looks similar to what we observe while exploring new kernel versions. Seems like it is claimed to be fixed after 5.17 but the test result doesn’t show it.

Hi @DaneLLL , thank you again for your reply.

I did more tests focused in the PCIe, because I also noticed a bad performance when having high network traffic, and my Ethernet interface is over the PCIe. My test was to send UDP packets with iperf3 --udp --client --bitrate 1000M.
When running kernel 4.9 (JP4), iperf -s has CPU usage around 50% in a single core.
When running kernel 5.10.104 (JP5), iperf -s has CPU usage around 82% in a single core.
These results, made me think about a possible PCIe driver issue as nvme and Ethernet are running over PCIe.

Checking the driver provided on over Driver Package (BSP) Sources, I could see the PCIe driver in kernel/nvidia/drivers/pci/dwc/pcie-tegra.c that is pretty the same used for JetPack4 - kernel 4.9.
The problem is that this driver is not compatible with kernel 5.10. For the PCIe driver, kernel is build from kernel/kernel-5.10/drivers/pci/controller/dwc/pcie-tegra194.c, that’s the “generic” driver coming from mainstream.

Both kernel are running with Power Mode: MAXN and clocks set to MAX via 'jetson_clocks`

Basically network and nvme performance are worst on JP5, both are running over PCIe.
Are you sure that this “generic” driver performs as good as the old one from nvidia/drivers/ folder? Do you have PCIe benchmarks comparing JP4 and JP5?
As you/Nvidia are observing the same results, how is Nvidia handling this performance issue? Are the new JetPacks being release it the worst performance?

We don’t have different handling for direct IO and buffering IO, but throughput is very different for the two cases. As of now we think it is due to some security mechanism for buffering IO in upstream kernel. We have tried to remove the security mechanism from K5.10 but it misbehaves. So looks like the mechanism is must-have for K5.10.

We would suggest use direct IO to achieve optimal throughput. And if you have further finding, please share to us.

By removing cgroup configuration and traces in the kernel config improved a bit, but it is not enough. For example dd command is now writing 500 MB/s to the nvme.
It didn’t solve the problem, at the end, no solution was found and we downgraded temporally to JP4.