Quadro P4000, TCC.
I have allocated 220MB of BAR1 memory for GpuDirect RDMA.
- I have an FPGA pushing data to this memory.
- The FPGA is given a page-table with all the GPU BAR1 page addresses (3520 GPU Pages).
- The data is simply an incrementing uint32bit value (1, 2, 3, 4....).
- The FPGA starts at page-table[0] and goes to page-table[3519] and then starts over at [0].
- Incremented value simply rolls over.
On the Host (System)
- I perform a cudaMemcpy of all 220MB at a time.
- Launch a kernel to set the first 32bits of each page to 0xDEADBEEF.
- In same kernel, loop until 0xDEADBEEF is gone.
Issue:
- I can go through several copy loops, however, at some point the data in the rdma buffer, from the kernel's perspective, stops changing
- Looking at a PCIe bus montior, I can see that the data-rate from the FPGA has not changed.
- This leaves me to believe that there is a problem on the GPU side.
- I speculate a memory access clash but I have no way of knowing?
Question:
- Am I running into a memory access clash?
- Is there a way to signal new data on the GPU side - run a kernel, callback, generate a GPU interrupt?
Host:
...
while( bRunning )
{
// Using Async so I can run this on its own stream.
// -- Other threads will be doing their own GPU work.
// Since the memcpy and kernel are on the same gpu-stream, they are executed serially.
cudaMemcpyAsync( pDstBuff_d, pRdmaBufBaseAddr_d, bytesToCopy, cudaMemcpyDeviceToDevice, cudaStream );
WaitForNewDataKernel <<< 1, numGpuPages, 0, cudaStream >>>( pRdmaBuffBaseAddr_d );
// Host - wait for new data.
cudaStreamSynchronize( cudaStream );
}
...
Kernel:
WaitForNewDataKernel( ... )
{
pPageTableVal += threadId * PAGE_TABLE_OFFSET;
// Set the value as "read"
*pPageTableVal = 0xDEADBEEF;
// Loop until new data is present
while( *pPageTableVal == 0xDEADBEEF )
{
++count;
if( 0 == count % 100000 )
{
printf( "Still waiting" );
}
}
}
Thank you for any help.