Increasing DMA channel priority for PCIe device

I have a PCIe device that is streaming data from the Xavier (host). These are DMA reads of a fixed location in host memory, with no interaction from the host. This channel has high bandwidth, low latency requirements that are not being met – these streaming reads are getting starved under high cpu+memory load on the host.

Is there a way to increase the priority of the DMA channel serving the PCIe device reads? Or an SMMU setting that can be adjusted?

Thanks for your help,

Kurt

Hi Kurt,

Can you please help us with a couple questions?

  1. Are you using Xavier DMA or some other ep DMA to initiate the reads?
  2. What is the memory controller configuration ? How about the DRAM freq? Is ECC enabled?
  3. What is the cpu +memory load bandwidth other than PCIe read? How much load is cpu reads?
  4. What is your expected memory bandwidth and latency? which pcie controller is being used?

. These are DMA reads of a fixed location in host memory
What do you mean by “fixed location” here? Is the start location a fixed addr?

Hi Cory,

Are you using Xavier DMA or some other ep DMA to initiate the reads?

An FPGA configured as a PCIe endpoint is performing the memory read transactions.

What is the memory controller configuration ? How about the DRAM freq? Is ECC enabled?

This is a standard 8GB or 16GB Xavier (the problem happens in both, but I’m focusing on the 8GB module right now). With the EMC frequency fixed at 1333 and everything else set to max in nvpmodel and jetson_clocks the problem occurs, albeit less frequently than at lower EMC frequencies.

What is the cpu +memory load bandwidth other than PCIe read? How much load is cpu reads?

I can easily aggravate the issue by lowering the EMC clock to 666 MHz and then using stress-ng --stream 2 (~15 GB/sec), which will contend with the PCIe writes enough for the PCIe stream to start missing deadlines periodically.

It’s not at all surprising that this kind of aggregate load is the maximum the bus can handle at this rate, and that’s the point. No matter what happens to the rest of the system, I’d like the PCIe transactions to take priority by some means, or perhaps like an ISOMMU client (such as the display) that seems to be resilient to this sort of thing.

What is your expected memory bandwidth and latency? which pcie controller is being used?

In these tests, I’m moving about 2 GB/sec over Gen 3 x4 on C5. The load on the host is going to be pretty unpredictable, so I just want to have some guarantee that PCIe traffic does not get contended with.

. These are DMA reads of a fixed location in host memory
What do you mean by “fixed location” here? Is the start location a fixed addr?

Yes, it’s a video framebuffer, with a fixed start and size that are not moving during these tests. The FPGA is streaming this data through a short fifo and out to its own display (hence the deadlines).

Thanks for your help,

Kurt

Our team is checking it. will update to you once we have a solution.

Thanks for your patience.

Hi Kurt,

In these tests, I’m moving about 2 GB/sec over Gen 3 x4 on C5. The load on the host is going to be pretty unpredictable, so I just want to have some guarantee that PCIe traffic does not get contended with.

Could you also share

  1. What is the bandwidth that the you are actually getting here?
  2. What is the ISO (Display/Camera) load in the system ?

Hi Wayne,

What is the bandwidth that the you are actually getting here?

The stream must maintain at least 2 GB/s and can’t be totally starved for more than 32 us. Under normal conditions, this is no problem. But under heavy memory load by other clients, these requirements are not met.

What is the ISO (Display/Camera) load in the system ?

Zero. The memory load that will interfere with PCIe can come solely from the CPU (stress-ng --stream, for example).

Kurt

Hi Kurt,

I also received your feedback from Arrow. According to your comment, you could still his issue even with the highest freq 1333 on 8GB Xavier but just less frequently. Is that correct?

It will still occur immediately and repeatedly at 1333 on the 8GB with the right memory-intensive workload – one that does not end up CPU bound. These workloads are fewer on the 8GB module than the 16GB module, because more work is CPU bound, but the problem is still there.