Hello, I use a Tesla P40 for real-time control applications. Our system uses a series of different cpu’s all reading and writing from various shared memory locations. One of these scripts reads from shared memory and then passes this data to the GPU using a gdr_copy_to_bar() call. Recently I have discovered that this copy has resulted in large delays which causes the entire system to fail via a timeout. These timeouts occur at unpredictable times and sometimes don’t occur at all in otherwise identical tests.
I was running the script responsible for the grd_copy_to_bar()'s with an affinity of 8 while my GPU was hosted by affinities:10-19 so after looking at the README:https://github.com/NVIDIA/gdrcopy/blob/master/README.md I was convinced that this delay was due to the copies occurring on different affinities from the GPU. However, after trying all of the affinities from 10-19 I was unable to avoid this issue altogether.
My question is if other processes are occasionally using the affinities in the GPU hosting range (10-19 in my case) could this cause the gdr_copy_to_bar() to freeze and cause delays as I am observing?
Thank you for your help.