PCIe DMA doesn't work for L4T 24.1

yahoo2016 · July 20, 2016, 6:40pm

Is the FPGA card off-the-shelf? We may want to get one to do the same test. If confirmed, we don’t mind to switch from Xilinx to Altera and buy IPs if necessary.

cospan · July 21, 2016, 5:48pm

The problem was with my HDL PCIE Controller.

I have to give thanks to Ron for pointing this out.

There is a part of the PCIE specification that is essentially optional. Basically when you ask for data from the host computer you should specify the byte mask. The root port can optionally check this mask. The desktop computer didn’t the TX1 did.

Thanks for all the feedback.

Dave

@Leon The Xilinx IP isn’t to blame, it works really well. The problem was with my IP.

vidyas · July 25, 2016, 4:33pm

Hi yahoo2016,
Have you tried running the script that would increase TX1 clock frequencies?

yahoo2016 · July 25, 2016, 6:06pm

Hi Vidyas,

I tried increase CPU clock but got no improvement. DMA test programs from both FPGA vendors are user space programs.

I identified the bottleneck is “copy_to_user” function in kernel.

It seems there are bugs for implementing “copy_to_user”, 200MB/s is too slow.

vidyas · July 29, 2016, 6:12am

So, this is an issue with your driver then…! Please update the perf numbers after it is fixed.

yahoo2016 · July 29, 2016, 12:54pm

“copy_to_user” on TX1 is at least 4 times slow than Intel CPUs.

We have used the same driver on Intel CPUs for years without DMA throughput issues.

Intel CPU/kernel must have more efficient ways to implement “copy_to_user”.

For TX1 running at > 1GHz CPU clock rate, and 32bit? wide SDRAM interface, it should not be only 200MB/s.

“copy_to_user” function is kernel function, not part of the driver.

Drivers use the kernel function to copy data from kernel space DMA buffer to user space.

yahoo2016 · August 1, 2016, 1:20pm

I tested TX1 user space “memcpy” under L4T 24.1, and got 3.5GB/s for transfer size of 4MB.

The kernel function “copy_to_user” is more than 10 time slower than user space “memcpy” fucntion.

CPU clock was max 1.734 GHz, EMMC clock was max 1.6GHz.

It seems a bug in “copy_to_user” function in L4T 24.1.

Can Nvidia test throughput of “copy_to_user” kernel function in L4T 24.1?

Thanks

vidyas · August 2, 2016, 4:32am

Can you describe how you profiled copy_to_user() and memcpy()? Did the test run in atomic context?

Were the user space buffers already populated (i.e., ptes were filled up)?
If not, there can be some page fault delay.

Can you profile access_ok() as well? Other than access_ok(), copy_to_user() is same as memcpy().

yahoo2016 · August 2, 2016, 1:23pm

For user space “memcpy” I used “clock” function at start and end of loop and call “memcpy” with variable “src” and “dst” 10000 times with size 4MB to calculate throughput.

For kernel “copy_to_user” function, I have only control on transfer size of “copy_to_user”. I need to transfer at least 16 bytes to user space to verify header of image data. I keep DMA size fixed (4MB) but vary size of “copy_to_user” function. When size of “copy_to_user” was 16 bytes, I got DMA throughput about 700MB/s. When size of “copy_to_user” was same as DMA size (4MB), throughput dropped to 200MB/s. That’s why I identified the bottleneck is “copy_to_user”.

The user space buffers are allocated and set to zeros.

The tests were single threaded after TX1 was powered up and no other process was accessing the buffers. System monitor was used to monitor CPU and memory usage. When DMAs were started, usage of one CPU increased from 0 to 60%.

I hope I did not ask too much for Nvidia to do some throughput tests on “copy_to_user” since Nvidia is more knowledgeable to TX1.

vidyas · August 3, 2016, 11:00am

We will check it and get back to you on this.

yahoo2016 · August 3, 2016, 12:33pm

Thanks for the effort.

We have another very reputable FPGA vendor, their DMA driver does not use “copy_to_user” but (quote)

pins the user-mode buffer in memory and (with the help of the kernel) builds a scatter-gather description of it.

This should be zero-copy operation, but it will not be a zero-copy operation if the kernel has to do “bounce buffering”.

So the low DMA throughput (200MB/s) for the seconds vendor’s high end Virtex 7 card was not due to “copy_to_user”, but something else (e.g., "“bounce buffering”).

Again the same FPGA card/driver/software was tested on Intel CPU for DMA throughput at least 4 times of TX1.

vidyas · August 4, 2016, 7:12am

Ok
So our understanding at this point is that copy_to_user is not the bottleneck.
Could you please re-iterate what is your expectation? and what is it that you want us to do?

yahoo2016 · August 4, 2016, 9:22am

From my posts:
(1). The DMA driver of the first FPGA vendor uses “copy_to_user” which is the bottleneck. I have asked Nvidia to test “copy_to_user”. This has not changed
(2). The DMA driver of the 2nd FPGA vendor does not use “copy_to_user”, the vendor believes the bottleneck for their driveris “bouncing buffers” of TX1.
There are multiple issues to limit DMA throughput to user space on TX1 but not on Intel CPU.
If Nvidia can solve one of the bottlenecks, we can use driver from that vendor.
I thought Nvidia might be aware “bouncing buffers” issue so I mentioned it, if not, please test “copy_to_user” function. The are other applications may need “copy_to_user”.

vidyas · August 9, 2016, 6:39pm

We have profiled copy_to_user() API and could get > 1GB/s throughput.
We would like to know some more information from you w.r.t throughput issue you are observing
→ what is the source memory of copy_to_user() API in your driver? Is it PCIe device’s BAR?
→ If it is allocated in kernel, can you please tell us the API used to allocate that memory? Also, is it possible to take a look at the end point driver in question here?

yahoo2016 · August 10, 2016, 6:21pm

The driver of the first vendor does:

(1). calls dma_alloc_coherent with size of 4MB:

	CPUBufferAddress = dma_alloc_coherent(&dev->dev, BufferLength, &DMABufferAddress, GFP_KERNEL);

(2). uses DMABufferAddress as DMA destination address
iowrite32(DMABufferAddress, CDMA_DA);

(3). “read” driver function uses “copy_to_user” to copy image data from kernel space “CPUBufferAddress” to user space “buffer”

	copy_to_user(buffer, CPUBufferAddress, BufferLength));

We can’t release vendors driver source without permission.

Can you post your code for profiling “copy_to_user” so we can run and compare with vendor’s driver?

Thanks

vidyas · August 11, 2016, 6:53am

Unlike x86 systems, Tegra doesn’t have support for IO data coherency i.e, data in CPU cache doesn’t get updated automatically when corresponding data in main memory gets updated by IO device.
It is because of this reason that dma_alloc_coherent() would return buffer from uncached region in Tegra, whereas it would be from a cached region in an x86 system and hence the perf difference.

yahoo2016 · August 11, 2016, 11:16am

That makes sense. Are there solutions for DMA transferring to Tegra memory and CPU accessing DMA data at same time without performance degradation?

Topic		Replies	Views
Jetson AGX Xavier Pcie(Root) detection of Device needs jetson reboot Jetson AGX Xavier pcie , fpga	25	2022	July 12, 2023
PCIE DMA Problem between TX2 & FPGA Jetson TX2	19	7442	October 18, 2021
Activate configuration 6 (UPHY lane assignments) on custom TX2i board Jetson TX2 board-design	24	1082	October 18, 2021
PCIe 2.0 on TX2 Not Meeting Specifications! Jetson TX2	8	4464	June 25, 2019
TX1 hangs just after boot up when PCIe card is connected Jetson TX1	22	3866	July 6, 2017
How to map kernel memory to user memory? Jetson TX2	47	8849	October 18, 2021
TX2 using PCIx1 Instead of USB3 default (Config 1) Jetson TX2	26	6094	October 1, 2019
PCIe not being recognized Jetson TX2	5	2079	October 18, 2021
TX1 <-> FPGA through PCIE Jetson TX1	42	15344	June 3, 2016
Slow remote DMA write and read Jetson TX2	26	2721	June 29, 2019

PCIe DMA doesn't work for L4T 24.1

Related topics