GpuDirect RDMA hanging??

Quadro P4000, TCC.
I have allocated 220MB of BAR1 memory for GpuDirect RDMA.

  • I have an FPGA pushing data to this memory.
  • The FPGA is given a page-table with all the GPU BAR1 page addresses (3520 GPU Pages).
  • The data is simply an incrementing uint32bit value (1, 2, 3, 4....).
  • The FPGA starts at page-table[0] and goes to page-table[3519] and then starts over at [0].
  • Incremented value simply rolls over.

On the Host (System)

  • I perform a cudaMemcpy of all 220MB at a time.
  • Launch a kernel to set the first 32bits of each page to 0xDEADBEEF.
  • In same kernel, loop until 0xDEADBEEF is gone.

Issue:

  • I can go through several copy loops, however, at some point the data in the rdma buffer, from the kernel's perspective, stops changing
  • Looking at a PCIe bus montior, I can see that the data-rate from the FPGA has not changed.
  • This leaves me to believe that there is a problem on the GPU side.
  • I speculate a memory access clash but I have no way of knowing?

Question:

  • Am I running into a memory access clash?
  • Is there a way to signal new data on the GPU side - run a kernel, callback, generate a GPU interrupt?

Host:

...
while( bRunning )
{
   // Using Async so I can run this on its own stream.
   // -- Other threads will be doing their own GPU work.

   // Since the memcpy and kernel are on the same gpu-stream, they are executed serially. 
   cudaMemcpyAsync( pDstBuff_d, pRdmaBufBaseAddr_d, bytesToCopy, cudaMemcpyDeviceToDevice, cudaStream );
   WaitForNewDataKernel <<< 1, numGpuPages, 0, cudaStream >>>( pRdmaBuffBaseAddr_d );

   // Host - wait for new data.
   cudaStreamSynchronize( cudaStream );
}
...

Kernel:

WaitForNewDataKernel( ... )
{
   pPageTableVal += threadId * PAGE_TABLE_OFFSET;

   // Set the value as "read"
   *pPageTableVal = 0xDEADBEEF;
   
   // Loop until new data is present
   while( *pPageTableVal == 0xDEADBEEF )
   {
     ++count;
     if( 0 == count % 100000 )
     {
        printf( "Still waiting" );
     }
   }
}

Thank you for any help.

is the pPageTableVal pointer marked with volatile?

Yes. The Kernel argument is prefaced with volatile.

Thanks.

I simplified my code and now all I do is call the Kernel.

I get through one or two kernel runs, but the kernel eventually “hangs” in the waiting loop.
– Again, PCIe bus shows traffic going to the GPU, in my case at 12GB/s

Host:

  • numPages = 3520
  • numBlocks = 4
  • numThreads = 1024
while( bRunning )
{
   WaitForNewDataKernel <<< numBlocks, numThreads, 0, cudaStream >>>( pRdmaBuffBaseAddr_d, numPages );
   cudaStreamSynchronize( cudaStream );

   threadSleep( 5000 );

}
WaitForNewDataKernel( volatile uint32_t* pPageTableVal, int maxThreadId )
{
   if( threadId >= maxThreadId )
      return;

   pPageTableVal += threadId * PAGE_TABLE_OFFSET;

   // Set the value as "read"
   *pPageTableVal = 0xDEADBEEF;
   
   // Loop until new data is present
   while( *pPageTableVal == 0xDEADBEEF )
   {
     ++count;
     if( 0 == count % 100000 )
     {
         printf( "WAITING: ThreadId=[%d]", threadId );
     }
   }
   printf( "Done: ThreadId=[%d]", threadId );
}