Error running TensorFlow Xid 13 Graphics Exception

I am writing to report an error that I have encountered occasionally while running a TensorFlow-based deep learning model. The error message that I received in user-app is as follows:
Could not synchronize CUDA stream: CUDA_ERROR_LAUNCH_FAILED: unspecified launch failure

Looking at NVidia-bug-report following is logged at exactly same time as above user app error.

Feb 24 11:49:36 sonic-2-ThinkStation-P620 kernel: NVRM: Xid (PCI:0000:61:00): 13, pid=1335, Graphics Exception: SKEDCHECK22_INVALIDATE_ACTIVE_QMD failed
Feb 24 11:49:36 sonic-2-ThinkStation-P620 kernel: NVRM: Xid (PCI:0000:61:00): 13, pid=1335, Graphics Exception: ESR 0x407020=0x20000000 0x407028=0x23f 0x40702c=0x20540e8 0x407030=0x0

I am using a NVIDIA Quadro RTX 8000 GPU for severing my model, and this error has been occurring repeatedly during the inference. I looked up Xid 13 on Nvidia’s website and seems like it could be caused due to several reasons. Our initial guess was it could be related to high GPU temperature but we monitored temperature logs for a few days and everything seems to fine.
Could you please help me identify the root cause of this error and provide me with any steps that I can take to resolve it?
Attached are nvidia bug report, gpu burn test and memory test.
Thank you for your pr
nvidia-bug-report.log (5.3 MB)
ompt attention to this matter. I look forward to hearing back from you soon.

I also posted this at Nvidia but they asked to check with cuda.

cuda-gpumemtest (1).txt (2.6 MB)

an unspecified launch failure is usually an illegal activity taking place in kernel code. If you do something illegal in kernel code, you will almost always get an Xid report in the system log.

the problem may be related to some way that you are using TF incorrectly, or else a defect in TF itself.

There is no way to identify the root cause without having an expert TF user that can identify the specific kernel launch that caused the failure, and do analysis of that kernel. It can’t be done based on what is reported here. Furthermore, this forum is not focused on TF support and you probably won’t find many expert TF users here.