User space general purpose DMA

I’m working with Jetson Xavier AGX, using some legacy code that does a lot of calls to memcpy(or similar) each for many MBs.
Code is run from userspace(in different threads) and takes a lot of CPU to handle all these copy actions.
Is there a multi purpose DMA on the SOC that can somehow be utilized (via driver perhaps) to queue these mem copy calls, or perform them instead of the CPU? in some async fashion?
Having worked before in SOC producing company we there were DMAs in the design to avoid CPU waste.

you can use DMA with Tegra as well with generic dma calls. You can browse through topics on dma in devtalk forum like

Thank you for your prompt response!

I tried the dma test and saw it was working on dma0chan19, and some others
A single thread can copy memory in up to ~120MB/s

Since I didn’t find a datasheet for more information on the GPC DMA, is it physically capable of more? I saw many channels but only dma0 works, and I assume the channels are queued.
i.e. I need to dma more in parallel using multiple HW devices (e.g. 550MB/s would be much nicer"

Also can you refer to some documentation on it?

Hi Itaig,

Yes, there are 5 DMA instances present on Tegra SoC which can be used to avoid CPU waste.
Since you are concerning with user’s data, its most suitable to use the GPC-DMA instance. Others are used from other R5 cores.

GCP-DMA is one of the DMA controllers which has 32 independent channels and can be programmed by different masters.

The DMA controller follows round robin arbitration scheme between the channels, starting with channel 0.
A channel can transfer a specified range of data.
Each channel can have independent burst transfer size programmed (1/2/4/8/16 words)

Xavier’s TRM has detailed descriptions on how to program and enable the GPC-DMA controller [Chapter: General Purpose Direct Memory Access (DMA) Engines]

Please let us know if you still have any issue during the configuration.

Thanks & Regards,

Thank you Sandipan,

“its most suitable to use the GPC-DMA instance”
I tested the GPC DMA successfully, and can use it. But I have an image stream of ~550MB/s, and each image is copied. I would like to offload more from the CPUs if possible.

“there are 5 DMA instances present on Tegra SoC …Others are used from other R5 cores.”
Found it in the TRM thanks. I see DMAs for AON,BPMP,SCE,RCE. Is there a way to control them from the Carmel CPU? (assuming they are mem->mem capable).
Maybe an indirect write somehow.

I don’t think that is possible.
Mostly those are dedicated DMA controllers for the R5.

But may be possible if access is configured. I say this because the DMA controllers are independent peripherals.
I will check and confirm from tegra-manual if the controller address is available. In that case it is a regular peripheral and they can be accessed from anywhere if right CBB access is provided.

For now, can I suggest configuring DMA and channels? It might help upto some extent.

Thanks & Regards,

Channels are AFAIK used for request/completion control, they will not parallelize the transfer, with a bandwidth exceeding a single transaction full speed.

Hi Itaig,

Yes, channels as your understand are used for request/completion control. But what can be tried is, allocate as much channels as possible to make sure maximum channels are not engaged serving other devices or they are not idle until the data copy (mem-mem) completes. It might help in saving switching time from round robin arbitration scheme of the channels. This way it can be assured that your process is using maximum possible BW of the GPCDMA.

Thanks & Regards,

Thank you Sandipan, I will find a way.