Error running TensorFlow Xid 13 Graphics Exception

I am writing to report an error that I have encountered occasionally while running a TensorFlow-based deep learning model. The error message that I received in user-app is as follows:
Could not synchronize CUDA stream: CUDA_ERROR_LAUNCH_FAILED: unspecified launch failure

Looking at NVidia-bug-report following is logged at exactly same time as above user app error.

Feb 24 11:49:36 sonic-2-ThinkStation-P620 kernel: NVRM: Xid (PCI:0000:61:00): 13, pid=1335, Graphics Exception: SKEDCHECK22_INVALIDATE_ACTIVE_QMD failed
Feb 24 11:49:36 sonic-2-ThinkStation-P620 kernel: NVRM: Xid (PCI:0000:61:00): 13, pid=1335, Graphics Exception: ESR 0x407020=0x20000000 0x407028=0x23f 0x40702c=0x20540e8 0x407030=0x0

I am using a NVIDIA Quadro RTX 8000 GPU for severing my model, and this error has been occurring repeatedly during the inference. I looked up Xid 13 on Nvidia’s website and seems like it could be caused due to several reasons. Our initial guess was it could be related to high GPU temperature but we monitored temperature logs for a few days and everything seems to fine.
Could you please help me identify the root cause of this error and provide me with any steps that I can take to resolve it?

Thank you for your prompt attention to this matter. I look forward to hearing back from you soon.

Sincerely,

Usman
nvidia-bug-report.log (5.3 MB)

You were also getting xids 31 and 61 at times. Please use gpu-burn and cuda-gpumemtest to check for defective hardware.

@generix
Thanks for the reply, I’ve attached gpu-burn test and cuda-gpumemtest logs. It seems like both tests turned out to be fine. What do you think ?
gpu_burn.txt (16.2 KB)

cuda-gpumemtest (1).txt (2.6 MB)

Looks good, the gpu should be fine. Since you’re running an Xserver on the nvidia gpu, please create
/etc/X11/xorg.conf.d/nvidia-interactive.conf

Section "OutputClass"
    Identifier "nvidia-interactive"
    MatchDriver "nvidia-drm"
    Driver "nvidia"
    Option "Interactive" "false"
EndSection

to avoid interference from xorg.
Maybe also check with the cuda forums to rule out an error with your application regarding stream synchronization.