Yes, it is a null pointer dereference, primarily for testing this core dump functionality. It is called through a wrapper (the wrapper is written in C), this is the code that is being called to cause the exception:
I’ve tried again using the synchronise call in the host code but to no avail. I’ve also tried with the debug flags you compiled with. Are those strictly necessary for generating the .nvcudmp?
6.4. Xid 43: Reset Channel Verif Error
This event is logged when a user application hits a software induced fault and must terminate. The
GPU remains in a healthy state.
In most cases, this is not indicative of a driver bug but rather a user application error.
Obviously we can agree on the last statement (user error). Is the driver getting in the way of the coredump being produced?
We will need to investigate the driver part of the coredump generation. Could you provide us with the following information:
nvidia-smi -q
Nvidia bug report. You can generate it by running the nvidia-bug-report.sh script as root. The script should be available as a part of your CUDA installation (should be present in PATH)
It would take us some time to investigate the issue, we will update this post as soon as we have something.
Hello!
We believe that it might be an issue in our stack. The fix should be available in one of the upcoming CUDA versions. I will update this post as soon as the fixed CUDA version is released.
Hi @AKravets. A colleague asked me to add libcudadebugger to the target system as part of a separate bit of work. I found that this helped the system generate a CUDA coredump.
For the record and anyone who finds this in the future; the system I am using is embedded, so I have been picking and choosing the smallest number of libraries needed and loading them individually. This is instead of installing the entire cuda toolkit to the system, which would use a lot of space.
So, adding libcudadebugger1_575.51.03-1_amd64.deb to the system helped, and the environment variables I exported were:
Yes, having the libcudadebugger package is a requirement for the GPU coredump generation. Since the GPU coredumps can be generated with this library present on the system, can I mark the topic as resolved?