PCIe DMA doesn't work for L4T 24.1

Is the FPGA card off-the-shelf? We may want to get one to do the same test. If confirmed, we don’t mind to switch from Xilinx to Altera and buy IPs if necessary.

The problem was with my HDL PCIE Controller.

I have to give thanks to Ron for pointing this out.

There is a part of the PCIE specification that is essentially optional. Basically when you ask for data from the host computer you should specify the byte mask. The root port can optionally check this mask. The desktop computer didn’t the TX1 did.

Thanks for all the feedback.

Dave

@Leon The Xilinx IP isn’t to blame, it works really well. The problem was with my IP.

Hi yahoo2016,
Have you tried running the script that would increase TX1 clock frequencies?

Hi Vidyas,

I tried increase CPU clock but got no improvement. DMA test programs from both FPGA vendors are user space programs.

I identified the bottleneck is “copy_to_user” function in kernel.

It seems there are bugs for implementing “copy_to_user”, 200MB/s is too slow.

So, this is an issue with your driver then…! Please update the perf numbers after it is fixed.

“copy_to_user” on TX1 is at least 4 times slow than Intel CPUs.

We have used the same driver on Intel CPUs for years without DMA throughput issues.

Intel CPU/kernel must have more efficient ways to implement “copy_to_user”.

For TX1 running at > 1GHz CPU clock rate, and 32bit? wide SDRAM interface, it should not be only 200MB/s.

“copy_to_user” function is kernel function, not part of the driver.

Drivers use the kernel function to copy data from kernel space DMA buffer to user space.

I tested TX1 user space “memcpy” under L4T 24.1, and got 3.5GB/s for transfer size of 4MB.

The kernel function “copy_to_user” is more than 10 time slower than user space “memcpy” fucntion.

CPU clock was max 1.734 GHz, EMMC clock was max 1.6GHz.

It seems a bug in “copy_to_user” function in L4T 24.1.

Can Nvidia test throughput of “copy_to_user” kernel function in L4T 24.1?

Thanks

Can you describe how you profiled copy_to_user() and memcpy()? Did the test run in atomic context?

Were the user space buffers already populated (i.e., ptes were filled up)?
If not, there can be some page fault delay.

Can you profile access_ok() as well? Other than access_ok(), copy_to_user() is same as memcpy().

For user space “memcpy” I used “clock” function at start and end of loop and call “memcpy” with variable “src” and “dst” 10000 times with size 4MB to calculate throughput.

For kernel “copy_to_user” function, I have only control on transfer size of “copy_to_user”. I need to transfer at least 16 bytes to user space to verify header of image data. I keep DMA size fixed (4MB) but vary size of “copy_to_user” function. When size of “copy_to_user” was 16 bytes, I got DMA throughput about 700MB/s. When size of “copy_to_user” was same as DMA size (4MB), throughput dropped to 200MB/s. That’s why I identified the bottleneck is “copy_to_user”.

The user space buffers are allocated and set to zeros.

The tests were single threaded after TX1 was powered up and no other process was accessing the buffers. System monitor was used to monitor CPU and memory usage. When DMAs were started, usage of one CPU increased from 0 to 60%.

I hope I did not ask too much for Nvidia to do some throughput tests on “copy_to_user” since Nvidia is more knowledgeable to TX1.

We will check it and get back to you on this.

Thanks for the effort.

We have another very reputable FPGA vendor, their DMA driver does not use “copy_to_user” but (quote)


pins the user-mode buffer in memory and (with the help of the kernel) builds a scatter-gather description of it.

This should be zero-copy operation, but it will not be a zero-copy operation if the kernel has to do “bounce buffering”.


So the low DMA throughput (200MB/s) for the seconds vendor’s high end Virtex 7 card was not due to “copy_to_user”, but something else (e.g., "“bounce buffering”).

Again the same FPGA card/driver/software was tested on Intel CPU for DMA throughput at least 4 times of TX1.

Ok
So our understanding at this point is that copy_to_user is not the bottleneck.
Could you please re-iterate what is your expectation? and what is it that you want us to do?

From my posts:
(1). The DMA driver of the first FPGA vendor uses “copy_to_user” which is the bottleneck. I have asked Nvidia to test “copy_to_user”. This has not changed
(2). The DMA driver of the 2nd FPGA vendor does not use “copy_to_user”, the vendor believes the bottleneck for their driveris “bouncing buffers” of TX1.
There are multiple issues to limit DMA throughput to user space on TX1 but not on Intel CPU.
If Nvidia can solve one of the bottlenecks, we can use driver from that vendor.
I thought Nvidia might be aware “bouncing buffers” issue so I mentioned it, if not, please test “copy_to_user” function. The are other applications may need “copy_to_user”.

We have profiled copy_to_user() API and could get > 1GB/s throughput.
We would like to know some more information from you w.r.t throughput issue you are observing
→ what is the source memory of copy_to_user() API in your driver? Is it PCIe device’s BAR?
→ If it is allocated in kernel, can you please tell us the API used to allocate that memory? Also, is it possible to take a look at the end point driver in question here?

The driver of the first vendor does:

(1). calls dma_alloc_coherent with size of 4MB:

	CPUBufferAddress = dma_alloc_coherent(&dev->dev, BufferLength, &DMABufferAddress, GFP_KERNEL);

(2). uses DMABufferAddress as DMA destination address
iowrite32(DMABufferAddress, CDMA_DA);

(3). “read” driver function uses “copy_to_user” to copy image data from kernel space “CPUBufferAddress” to user space “buffer”

	copy_to_user(buffer, CPUBufferAddress, BufferLength));

We can’t release vendors driver source without permission.

Can you post your code for profiling “copy_to_user” so we can run and compare with vendor’s driver?

Thanks

Unlike x86 systems, Tegra doesn’t have support for IO data coherency i.e, data in CPU cache doesn’t get updated automatically when corresponding data in main memory gets updated by IO device.
It is because of this reason that dma_alloc_coherent() would return buffer from uncached region in Tegra, whereas it would be from a cached region in an x86 system and hence the perf difference.

That makes sense. Are there solutions for DMA transferring to Tegra memory and CPU accessing DMA data at same time without performance degradation?