I am running nvidia-driver 410 with CUDA 10 on 1080ti. I have periodic video transcoding process(ffmpeg) that run on the GPU. After running for sometime, these process are stuck and do not complete. While trying to spawn a new transcoding session, I get an error saying that no GPU available.
The only solution for this has been to unload and reload the nvidia kernel after killing all the stuck processes. The logs show a XID31 error, which is “a GPU memory page fault”. I am not sure if this is a driver issue ( I have tried with other nvidia-396 and cuda9.2 and get the same error).
Any ideas on how to proceed with debugging?