PCIe FPGA access woes

sktpin · October 17, 2018, 9:54am

Hey,

up front: I’m not sure this is the right forum for this, but since here perhaps are some folks who know something about the hardware structure of the Tegra K1, and where I asked before (e.g. Xilinx), maybe not so much, maybe someone throws some new insights at me…
Or points me to another (sub-)forum ;)

What’s this about:

I have a Avionic Design “Meerkat” board with the NVIDIA Tegra K1 on it.
On there is running the L4T Linux, kernel 3.10.105, I think plus some Avionic specific drivers, from their github.
A FPGA devboard (Artix7) is connected to the ARM CPU on the TK1, via PCI-express.
The FPGA structure defines some register blocks, and a 64KB memory area, which are memory-mapped.

I have attempted to communicate with the FPGA in two ways:

via the /dev/mem driver, doing 2 mmap()'ings, one for the registers, one for the memory area for main data transfer.
The Xilinx page called “Accessing+BRAM+In+Linux” claims that opening the /dev/mem file with the O_SYNC flag causes the mem mapping of the physical (FPGA/PCIe) adresses to virtual memory to be non-cached, and very slow access would confirm this to be the case.
Copying the 64K a few 10’s of times yields a transfer rate of ~ 2MByte/s, which I would consider extremely slow for PCIe (not to mention 4 lanes). It’s done with a for loop in 32bit words, though, as memcpy freezes.
But: I see updates on the memory-mapped registers (used e.g. for hand-shake) only reliably when I have a GDB breakpoint in the CPU code, right before reading the register.
From my perspective of limited knowledge of these things, this still sounds cache related?
I compiled the Xilinx XDMA driver, provided in Xilinx_Answer_65444_Linux, against the L4T kernel source, for ARM gnueabihf.
I am aware that, in the given form, the driver is spec’d for x86 systems only. I was told that the limitation is probably due to non-x86 systems, such as the ARM Cortex A on the TK1 maybe (?), potentially being non-cache-coherent, resulting in the CPU still accessing old data from its cache, while the DMA driver writes direcly to RAM. Can someone confirm this is the case on TK1?
So the driver probably needs to be modified - since that involves a road block for me, not a kernel dev so far, I tried out what happens as-is first.

So, while blindly copying the 64K block some times via DMA (using /dev/xdma0_c2h0) yields a usable 200MB/sec, my hand-shaking registers (using /dev/xdma0_user, which accesses registers without DMA, and again, mmap() to map it into user space) still have the same problem:
It only works when there is a breakpoint.

Now if I’d only see this with the “x86 only !” xdma driver, I’d not be surprized, for the reasons outlined above. But with /dev/mem, a part of the Linux, too?
I don’t have much of an idea what to further look for.

Does this make sense to anyone here?

Can someone point me to the proper diagnostic tools to use, for this kind of scenario? At the moment it’s a bit like poking at an alien space probe with a stick.
(I have no kernel dev experience, only done some bare metal MCU work)

Regards,
SK

PS. sorry for not posting relevant weblinks, but when my first post at another forum had links, it was marked as spam in the blink of an eye)

vidyas · October 18, 2018, 2:57pm

First one looks to be some issue with caching. But, I’m not sure if O_SYNC flag would really make the memory uncached.
As far as second option is considered, I’m assuming Xilinx XDMA driver to be a PCIe client driver which runs on host, in which case there shouldn’t be any dependency on the architecture of the host platform. Isn’t that driver using dma_map_* APIs to allocate memory in Tegra’s system memory? (which takes care of coherency issue between CPU and PCIe IP). Is it possible to share this driver offline?

sktpin · October 18, 2018, 3:12pm

Thanks for your reply.
I have searched the driver source code for dma_map, and not found anything.

The source code is here, at the very bottom, Xilinx_Answer_65444_Linux_Files:
[url]https://www.xilinx.com/support/answers/65444.html[/url]

sktpin · October 22, 2018, 9:09am

Oh, by the way, what comes to mind is that, while when using the DMA memory, inconsistent data may be expected - the problem observed, for now, is with the reading of registers already, through the part of the driver which does not employ DMA. E.g. in the Xilinx driver it’s the /dev/xdma0_user file, opened and then also mapped with mmap.
So, DMA isn’t even the issue, yet anyway.

Forget that part. That was me being mislead by a false similarity I saw in an earlier scenario I had and thus not noticing a bug in the current state of the code.
I do get the register content I wanted (read without DMA) with the value expected.

Remains now just getting wrong bulk data via DMA, for a part of the buffer anyway.
Could something else be still wrong.
Otherwise, I guess that’s the part where I need to modify the driver, that’s gonna be fun ;)

vidyas · October 26, 2018, 7:38am

Is it always the case that the wrong data starts from a specific offset and that offset related to system’s page_size in anyway? Also, is driver doing any page level attribute changes?

sktpin · November 2, 2018, 11:36am

Currently, wrong data is not a problem. If there is a potential caching issue, that’s not what is currently showing. The other stuff was due to bugs.
But now I found mentioned by a Xilinx employee that the XDMA driver does not really implement the user interrupts (that would aparently be exposed via file reads from special device files, which would sleep until an event occured), so currently I’m polling, eating 85% CPU, not a viable soluton ;)

sktpin · November 5, 2018, 10:57am

So, I’ve found another interesting driver (+ FPGA IP counter part):
https://github.com/maltevesper/JetStream-driver/blob/master/driver/fpga_driver.c

It appears to use two x86-specific functions, set_memory_uc and set_memory_wb, at the bottom:

static int mmap(struct file* file, struct vm_area_struct* vma) {
	int ret;
	int slot     = (uintptr_t)PDE_DATA(__parent(__parent(file->f_path.dentry))->d_inode);
	int channel  = (uintptr_t)proc_get_parent_data(file->f_inode);
	int bufferNo = (uintptr_t)PDE_DATA(file->f_inode);

	struct fpga_board* board = &FPCI3.boards[slot];
	struct buffer* buffer = &board->channel[channel].buffer[bufferNo];

	//printk("DATA Board id %d:%d:%d\n", slot, channel, buffer);

	vma->vm_flags |= VM_IO | VM_DONTEXPAND | VM_DONTDUMP;
	//vma->vm_page_prot = 0;
	vma->vm_page_prot   = pgprot_noncached(vma->vm_page_prot);

	//return ENOMEM;
	set_memory_uc((uintptr_t)buffer->memoryAddress, getPageCount(BUF_SIZE));
	ret=dma_mmap_attrs(&board->pcidev->dev, vma, buffer->memoryAddress, buffer->dmaAddress, BUF_SIZE, 0);
	set_memory_wb((uintptr_t)buffer->memoryAddress, getPageCount(BUF_SIZE));
	return ret;
}

At least it refuses to compile because of those, and here it’s mentioned it was introduced for x86, and I have not found anything of those combined with “ARM”:
https://lwn.net/Articles/183225/

Can the intended behavior be replicated on the ARM CPU on NVIDIA TK1, with other means?

Topic		Replies	Views
DMA transfer between Jetson TK1 and PCIe Jetson TK1	7	4454	December 31, 2015
Translate virtual adress to a bus address suitable for DMA Jetson TX1	10	4613	May 17, 2016
NX shared memory issue - any insights? Jetson Xavier NX	6	1569	October 18, 2021
PCIe DMA on Tegra (Xavier NX) Jetson AGX Xavier kernel	24	1885	July 13, 2022
PCIe-AXI DMA error after migration from r23.2 to r24.2 Jetson TX1	18	2969	October 18, 2021
PCIe DMA transfer performance issue with custom FPGA board on Jetson TX2 Jetson TX2 pcie , kernel , fpga	2	931	July 12, 2022
Real-time GPU processing Peer 2 peer data copy, Linux kernel memory, kernels in kernel, CUDA Programming and Performance	35	8098	June 30, 2010
Unexpected low performance of PCIe DMA to TX1 Jetson TX1	8	1572	May 8, 2017
From NIC to GPU. CUDA Programming and Performance	40	13582	February 12, 2011
GPU direct access to DMA memory over PCIe Jetson Xavier NX pcie , cuda	4	2267	April 22, 2022

PCIe FPGA access woes

Related topics