331.67 Driver Calls Abort in Application

Here’s the backtrace. What could cause the driver to call abort? The computer is on a remote site, but if needed, I can try to get a bug report.



What application does this happen with? How do you reproduce the issue? Please also attach a nvidia-bug-report.

This happens when running our application. Our application runs on multiple computers in a rack, but the crash always occurs on the same computer location in two separate racks. We thought it could be a power or temperature problem, but I don’t know how to parse the bug report. Here’s the bug report.

This report was collected after the crash.

nvidia-bug-report3102016.log.gz (340 KB)

Snuffalufagus, I see in log :

Mar 9 18:29:13 pcig_render3 kernel: NVRM: os_pci_init_handle: invalid context!
Mar 9 18:29:13 pcig_render3 kernel: NVRM: GPU at 0000:04:00: GPU-bf9cb650-8887-5d73-a78e-6dbec29a8f5d
Mar 9 18:29:13 pcig_render3 kernel: NVRM: Xid (0000:04:00): 31, Ch 00000003, engmask 00000101, intr 10000000

You can see xids error code here : http://docs.nvidia.com/deploy/xid-errors/index.html

What application you are running? To debug this issue we need to reproduce it internally.