Larger pinned cudaMemcopies causing pcie to wait too long

On TX1, Jetpack 3.2. On our system, we have a FPGA writing data over PCIE to DMA memory and at the same time the GPU is continuously doing work: kernel launches, and pinned H <-> D transfers.
At lower data rates, the system is working fine. However at higher data rates (~3Gbps of PCIE), the PCIE has problems if cuda Memcopies are happening at the same time.

Without any cudaMemcopies, the systems works as expected.

If I split each cudaMemcpy call (of a few Mb) into ~ 20 smaller ones, then the system still works fine.

I find that larger cudaMemcpy calls cause the PCIE controller to wait for too long and overrun buffers.

This leads me to believe that the copy engine in the GPU, will hog memory bandwidth when performing a transfer and PCIE cannot afford to wait for too long for access to DRAM.
Is there a reason that GPU and PCIE are not sharing the bandwidth nicely?
Does the GPU transfer take priority over PCIE requests?
Can I assign priority to PCIE over GPU?

Akmal

I find that larger cudaMemcpy calls cause the PCIE controller to wait for too long and overrun buffers
How are these buffer overruns detected? At PCIe level, we already have flow control mechanism right? Or, are you saying, at the source (i.e. FPGA), since the buffers are not emptied in time (as there is a bandwidth issue for PCIe), buffers get overrun with new data??

Our case is :
" At the source (i.e. FPGA), since the buffers are not emptied in time (as there is a bandwidth issue for PCIe), buffers get overrun with new data"

The FPGA has some small buffers in which is stores data. It can handle some back pressure and start filling up the buffer and emptying when back pressure is no longer applied.

However I am finding that as I perform cuda Memcopies in larger chunks instead of smaller ones (For the same amount of data), that backpressure is applied for longer causing the FPGA’s buffers to overrun.

hi,I have the same problem. How did you solve it?

Hi, Though I have moved on to using the TX2 which wasn’t limited by the same problem. I found a solution on the TX1 by making PCIE reads high priority in the SMMU. You can try the following fix or perform the memcopies in smaller chunks.

Disclaimer: I do not claim that the following steps will “fix” the problem and hold no responsibility for the consequences of trying the steps. Try at your own risk.

Notes I made are below:

  1. Write to PTSA (Arbiter) registers to make make PCIE reads and writes high priority on the TX1

// Following commands make PCIE high priority.
devmem 0x700194b4 w 0x000001
devmem 0x700194b0 w 0x000001

// Following commands return PCIE to normal priority.

devmem 0x700194b4 w 0x000000
devmem 0x700194b0 w 0x00003E

// Useful definitions.

#define MC_PCX_PTSA_MAX_REG_X1 0x700194b4
#define MC_PCX_PTSA_MIN_REG_X1 0x700194b0
#define MC_PCX_PTSA_MAX_DEFAULT_X1 0x000000
#define MC_PCX_PTSA_MIN_DEFAULT_X1 0x00003E
#define MC_PCX_PTSA_MAX_PRIORITISE_X1 0x000001
#define MC_PCX_PTSA_MIN_PRIORITISE_X1 0x000001