I am writing to report an error that I have encountered occasionally while running a TensorFlow-based deep learning model. The error message that I received in user-app is as follows:
Could not synchronize CUDA stream: CUDA_ERROR_LAUNCH_FAILED: unspecified launch failure
Looking at NVidia-bug-report following is logged at exactly same time as above user app error.
Feb 24 11:49:36 sonic-2-ThinkStation-P620 kernel: NVRM: Xid (PCI:0000:61:00): 13, pid=1335, Graphics Exception: SKEDCHECK22_INVALIDATE_ACTIVE_QMD failed
Feb 24 11:49:36 sonic-2-ThinkStation-P620 kernel: NVRM: Xid (PCI:0000:61:00): 13, pid=1335, Graphics Exception: ESR 0x407020=0x20000000 0x407028=0x23f 0x40702c=0x20540e8 0x407030=0x0
I am using a NVIDIA Quadro RTX 8000 GPU for severing my model, and this error has been occurring repeatedly during the inference. I looked up Xid 13 on Nvidia’s website and seems like it could be caused due to several reasons. Our initial guess was it could be related to high GPU temperature but we monitored temperature logs for a few days and everything seems to fine.
Could you please help me identify the root cause of this error and provide me with any steps that I can take to resolve it?
Attached are nvidia bug report, gpu burn test and memory test.
Thank you for your pr
nvidia-bug-report.log (5.3 MB)
ompt attention to this matter. I look forward to hearing back from you soon.
I also posted this at Nvidia but they asked to check with cuda.