On TX1, Jetpack 3.2. On our system, we have a FPGA writing data over PCIE to DMA memory and at the same time the GPU is continuously doing work: kernel launches, and pinned H <-> D transfers.
At lower data rates, the system is working fine. However at higher data rates (~3Gbps of PCIE), the PCIE has problems if cuda Memcopies are happening at the same time.
Without any cudaMemcopies, the systems works as expected.
If I split each cudaMemcpy call (of a few Mb) into ~ 20 smaller ones, then the system still works fine.
I find that larger cudaMemcpy calls cause the PCIE controller to wait for too long and overrun buffers.
This leads me to believe that the copy engine in the GPU, will hog memory bandwidth when performing a transfer and PCIE cannot afford to wait for too long for access to DRAM.
Is there a reason that GPU and PCIE are not sharing the bandwidth nicely?
Does the GPU transfer take priority over PCIE requests?
Can I assign priority to PCIE over GPU?