We are using Jetson AGX Xavier and Orin for Automotive product development.
By customer’s idea, we’d like to transfer the RAW data in PCIe interface with low latency and system consumption.
We tried the data transfer via virtual network on PCIe but the performance result can’t be met.
Therefore, we’d like to get your recommendations:
What is the most recommended data type and data transfer protocol to use in PCIe Gen4 interface?
For the data routing from camera input to NV SOM, is there any effective way of data transmission in this part? Below is our camera architecture stack flow this moment.
This isn’t an answer, but sort of describes some things related to your question.
If this requires functional safety, then you will need something with ARM Cortex-R, and there are variants of the Jetson AGX Orin and Xavier that might be better suited. I don’t know if they are available yet, but I think the IGX Orin might have something in addition to the usual AGX Orin for functional safety. Or full-sized automobiles might use something in the Drive line. Both hardware and software differ.
Whenever you do wish to maximize communications on an ordinary Jetson you’ll want to select the performance model (which is just a constraint on range of clock speeds and which CPU cores are active for saving power) such as via “sudo nvpmodel -m 0”. That makes performance range available. Then, once that is selected, make sure you have maximized clocks within that performance model with “sudo jetson_clocks”.
Sometimes an autosuspend can get in the way as well, although this is more of a USB issue, and if PCIe is working at any speed, then autosuspend is not an issue.
You’ll also want to be sure the interface is actually acting in PCIe v4 mode. If you run “lspci”, then you’ll see a slot number at the left, and you can limit the query to just that slot. You’ll want to post the fully verbose lspci for that device, which you can get via (I’ll pretend the slot is “01:00.0”, but adjust that to your situation): sudo lspci -s 01:00.0 -vvv 2>&1 | tee log_lspci.txt
I have turned on the maximum power by nvpmodel command and the maximum frequency by jetson_clocks command before posting this ticket, so I don’t think that should be the root cause or problems.
And also, I can make sure the PCIe actually works in Gen4 because a component connects to NVidia SOM by PCIe bus and the vendor provided me a tool to check PCIe linking status.
Anyway, my requirement is to improve the raw data transmission rate via PCIe bus, and currently, my test case is a video capturing through gstreamer packets via IP network.
I am not sure if this condition would be limited by network management from Nvidia original design? For example, QoS? or it just the IP protocol limitiation since the MAX length of the IP packet is only 64KB?
Or even this is the limitation of Nvidia platform now?
Please give me some ideas or suggestions, thank you very much.
I can’t give you a good answer. I will suggest that sometimes the limitation is the CPU core itself being a bottleneck, and so if DMA is possible, then that could possibly help. Also, there might be particular data transfer sizes which work better than other sizes. For example, the MTU can change transfer efficiency by a large amount with networking. For networking I suggest you post the current ifconfig, and for PCIe I suggest you post the fully verbose sudo lspci -vvv. It sounds like you’ve tried things related to packet size, but I don’t think anyone here really knows exact details of what you’ve tried so far.
QoS won’t really help if your device is already the only one on the network (or the only device with significant traffic). We don’t know if it is TCP or UDP, so on. I don’t even know if it is possible to meet the requirements, but if you have details on exactly what the data requirements are for throughput and latency, it might help whoever does know. For the case of camera data maybe knowing things like sensor size and frame rate and color depth would help, but right now you are the only one who knows those details.
Actually, my requirement is to transmit big data(over 10Gbps) through PCIe bus, and in my test case, I tried to use 12 raw data video to simulate the big data. And if my understanding is correct, the camera capture should almost synchronize the received video(very low latency) if the transmission rate is enough.
Under this thinking, I expect the video latency and quality transmitted via 10G Ethernet channel would be worse than that transmitted via PCIe Gen4, but not at all so I post this ticket for some ideas and suggestions form NVidia.
Someone from NVIDIA will have to answer. This is kind of “barely” useful for your case, but if you have the RT kernel installed (and this can be a major thing to install and cause other problems), you could maybe increase the priority of ksoftirqd processes. This only helps if the driver is split between a hardware IRQ handler (the actual ethernet device driver) and software IRQ handler (usually a larger driver does as little as possible in the non-preemptable hardware IRQ, and then triggers a software IRQ which is preemptable for things like checksums which are not hardware assisted in the NIC). It’s much easier to change priority of user space processes, but I saw this for a ksoftirqd suggestion with an unrelated ethernet chipset: https://e2e.ti.com/support/processors-group/processors/f/processors-forum/587925/linux-am4379-how-to-change-interrupt-priority-for-ethernet
Note that if you examine “cat /proc/interrupts” (or “watch -n 1 cat /proc/interrupts”, perhaps with a grep to filter for “eth” or “qos”) you are looking at hardware IRQs which probably have no choice but to run on CPU0 due to wiring. What you see when examining the processes of ksoftirqd are drivers which can migrate to any CPU core. It isn’t unusual to want to keep both hardware and software IRQs on the same core to take advantage of cache hits, but in the case of large data transfers it is possible there is no need to try to worry about cache hits/misses.
Overall, you can try to tune network packet sizes, and reach the limit of the core which is servicing the network device. If you can’t move the IRQ to its own core (due to physical wiring and lack of an IO APIC), then all you can do is try to make the data movement as efficient as possible, or try to use DMA (which I have no idea if it is possible on the integrated ethernet). My thought is that you won’t achieve what you want without some very out of the box thinking, and maybe not even then due to bottlenecking on a core.
NVIDIA would need to comment on any registers, but if there are ways to tune this, then I would think a lot of people would be happy about it.