PCIe FPGA access woes


up front: I’m not sure this is the right forum for this, but since here perhaps are some folks who know something about the hardware structure of the Tegra K1, and where I asked before (e.g. Xilinx), maybe not so much, maybe someone throws some new insights at me…
Or points me to another (sub-)forum ;)

What’s this about:

I have a Avionic Design “Meerkat” board with the NVIDIA Tegra K1 on it.
On there is running the L4T Linux, kernel 3.10.105, I think plus some Avionic specific drivers, from their github.
A FPGA devboard (Artix7) is connected to the ARM CPU on the TK1, via PCI-express.
The FPGA structure defines some register blocks, and a 64KB memory area, which are memory-mapped.

I have attempted to communicate with the FPGA in two ways:

  1. via the /dev/mem driver, doing 2 mmap()'ings, one for the registers, one for the memory area for main data transfer.
    The Xilinx page called “Accessing+BRAM+In+Linux” claims that opening the /dev/mem file with the O_SYNC flag causes the mem mapping of the physical (FPGA/PCIe) adresses to virtual memory to be non-cached, and very slow access would confirm this to be the case.
    Copying the 64K a few 10’s of times yields a transfer rate of ~ 2MByte/s, which I would consider extremely slow for PCIe (not to mention 4 lanes). It’s done with a for loop in 32bit words, though, as memcpy freezes.
    But: I see updates on the memory-mapped registers (used e.g. for hand-shake) only reliably when I have a GDB breakpoint in the CPU code, right before reading the register.
    From my perspective of limited knowledge of these things, this still sounds cache related?

  2. I compiled the Xilinx XDMA driver, provided in Xilinx_Answer_65444_Linux, against the L4T kernel source, for ARM gnueabihf.
    I am aware that, in the given form, the driver is spec’d for x86 systems only. I was told that the limitation is probably due to non-x86 systems, such as the ARM Cortex A on the TK1 maybe (?), potentially being non-cache-coherent, resulting in the CPU still accessing old data from its cache, while the DMA driver writes direcly to RAM. Can someone confirm this is the case on TK1?
    So the driver probably needs to be modified - since that involves a road block for me, not a kernel dev so far, I tried out what happens as-is first.

So, while blindly copying the 64K block some times via DMA (using /dev/xdma0_c2h0) yields a usable 200MB/sec, my hand-shaking registers (using /dev/xdma0_user, which accesses registers without DMA, and again, mmap() to map it into user space) still have the same problem:
It only works when there is a breakpoint.

Now if I’d only see this with the “x86 only !” xdma driver, I’d not be surprized, for the reasons outlined above. But with /dev/mem, a part of the Linux, too?
I don’t have much of an idea what to further look for.

Does this make sense to anyone here?

Can someone point me to the proper diagnostic tools to use, for this kind of scenario? At the moment it’s a bit like poking at an alien space probe with a stick.
(I have no kernel dev experience, only done some bare metal MCU work)


PS. sorry for not posting relevant weblinks, but when my first post at another forum had links, it was marked as spam in the blink of an eye)

First one looks to be some issue with caching. But, I’m not sure if O_SYNC flag would really make the memory uncached.
As far as second option is considered, I’m assuming Xilinx XDMA driver to be a PCIe client driver which runs on host, in which case there shouldn’t be any dependency on the architecture of the host platform. Isn’t that driver using dma_map_* APIs to allocate memory in Tegra’s system memory? (which takes care of coherency issue between CPU and PCIe IP). Is it possible to share this driver offline?

Thanks for your reply.
I have searched the driver source code for dma_map, and not found anything.

The source code is here, at the very bottom, Xilinx_Answer_65444_Linux_Files:

Oh, by the way, what comes to mind is that, while when using the DMA memory, inconsistent data may be expected - the problem observed, for now, is with the reading of registers already, through the part of the driver which does not employ DMA. E.g. in the Xilinx driver it’s the /dev/xdma0_user file, opened and then also mapped with mmap.
So, DMA isn’t even the issue, yet anyway.

Forget that part. That was me being mislead by a false similarity I saw in an earlier scenario I had and thus not noticing a bug in the current state of the code.
I do get the register content I wanted (read without DMA) with the value expected.

Remains now just getting wrong bulk data via DMA, for a part of the buffer anyway.
Could something else be still wrong.
Otherwise, I guess that’s the part where I need to modify the driver, that’s gonna be fun ;)

Is it always the case that the wrong data starts from a specific offset and that offset related to system’s page_size in anyway? Also, is driver doing any page level attribute changes?

Currently, wrong data is not a problem. If there is a potential caching issue, that’s not what is currently showing. The other stuff was due to bugs.
But now I found mentioned by a Xilinx employee that the XDMA driver does not really implement the user interrupts (that would aparently be exposed via file reads from special device files, which would sleep until an event occured), so currently I’m polling, eating 85% CPU, not a viable soluton ;)

So, I’ve found another interesting driver (+ FPGA IP counter part):

It appears to use two x86-specific functions, set_memory_uc and set_memory_wb, at the bottom:

static int mmap(struct file* file, struct vm_area_struct* vma) {
	int ret;
	int slot     = (uintptr_t)PDE_DATA(__parent(__parent(file->f_path.dentry))->d_inode);
	int channel  = (uintptr_t)proc_get_parent_data(file->f_inode);
	int bufferNo = (uintptr_t)PDE_DATA(file->f_inode);

	struct fpga_board* board = &FPCI3.boards[slot];
	struct buffer* buffer = &board->channel[channel].buffer[bufferNo];

	//printk("DATA Board id %d:%d:%d\n", slot, channel, buffer);

	vma->vm_flags |= VM_IO | VM_DONTEXPAND | VM_DONTDUMP;
	//vma->vm_page_prot = 0;
	vma->vm_page_prot   = pgprot_noncached(vma->vm_page_prot);

	//return ENOMEM;
	set_memory_uc((uintptr_t)buffer->memoryAddress, getPageCount(BUF_SIZE));
	ret=dma_mmap_attrs(&board->pcidev->dev, vma, buffer->memoryAddress, buffer->dmaAddress, BUF_SIZE, 0);
	set_memory_wb((uintptr_t)buffer->memoryAddress, getPageCount(BUF_SIZE));
	return ret;

At least it refuses to compile because of those, and here it’s mentioned it was introduced for x86, and I have not found anything of those combined with “ARM”:

Can the intended behavior be replicated on the ARM CPU on NVIDIA TK1, with other means?