Gdr_copy_to_bar Causes Random Delay #119

afb2137 · May 5, 2020, 8:56pm

Hello, I use a Tesla P40 for real-time control applications. Our system uses a series of different cpu’s all reading and writing from various shared memory locations. One of these scripts reads from shared memory and then passes this data to the GPU using a gdr_copy_to_bar() call. Recently I have discovered that this copy has resulted in large delays which causes the entire system to fail via a timeout. These timeouts occur at unpredictable times and sometimes don’t occur at all in otherwise identical tests.

I was running the script responsible for the grd_copy_to_bar()'s with an affinity of 8 while my GPU was hosted by affinities:10-19 so after looking at the README:gdrcopy/README.md at master · NVIDIA/gdrcopy · GitHub I was convinced that this delay was due to the copies occurring on different affinities from the GPU. However, after trying all of the affinities from 10-19 I was unable to avoid this issue altogether.

My question is if other processes are occasionally using the affinities in the GPU hosting range (10-19 in my case) could this cause the gdr_copy_to_bar() to freeze and cause delays as I am observing?
Thank you for your help.