After upgrading to the CUDA toolkit 10.1 (and driver 418 included), we are having issues profiling. Originally we had the “no permissions” issue, which we fixed using the modprobe.d configuration fix.
Now we get a signal 139 if we profile any application that uses unified memory.
Full disclosure: I did have to patch the driver version 418 to get it to work with our 5.1.5 and now 5.1.8 kernel. I did not have to patch 430. The patches were not functional changes; simply changes to some of the function interfaces (i.e. change int to unsigned int).
You can find the contents of the patch here:
I guess it’s possible I won’t be able to get help since the kernel does not match the driver. However, I don’t think the changes matter much. It seems like a deeper issue/ return of an old bug from cuda ~7/8, but i’m not sure. The last time it worked was on cuda 9.2.
If anyone has any suggestions, that would be great. Maybe this would better serve as a bug report.
Thanks.
May I ask you to give a try to the CUDA 10.2 toolkit? If you can wait, it’s be better to use CUDA 11, which will be available soon.
If this issue still occurs, having more details would help us to inspect the issue at our end. We need details about the GPU used, and a minimal reproducer.
This still happens for me on 10.2.
NVIDIA-SMI 440.33.01 Driver Version: 440.33.01 CUDA Version: 10.2
Installed from cuda_10.2.89_440.33.01_linux.run for Fedora
5.3.11-100.fc29.x86_64 #1 SMP Tue Nov 12 20:41:25 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
4x Titan V GPUs
AMD EPYC 7551
When it happens, in DMESG i get:
[82999.427686] break[75026]: segfault at 0 ip 00007f3df4b29a79 sp 00007f3dedfcdad8 error 4 in libc-2.28.so[7f3df4aa6000+14d000]
[82999.427700] Code: c3 0f b7 4c 16 fe 0f b7 36 66 89 4c 17 fe 66 89 37 c3 48 81 fa 00 08 00 00 77 8a 48 81 fa 80 00 00 00 77 70 48 83 fa 40 72 47 <0f> 10 06 0f 10 4e 10 0f 10 56 20 0f 10 5e 30 0f 10 64 16 f0 0f 10
Initializing memory on CPU still causes error, however the crash seems to happen on the kernel call. W/o kernel call and with cpu-side initialization, nvprof works and tells me I get one page fault (as expected).
I am about to try CUDA-11.0 because we need this functionality again. I resolved it before by rolling back to CUDA 9.0, but we have since done a clean reinstall and were using 10.2 for a while before needing this again. It seems that Fedora has been dropped from the supported x86 architectures, so I’m not sure what to expect but will report back if install is successful…