Unified Memory Signal 139 Cuda 10.1

Hi all,

After upgrading to the CUDA toolkit 10.1 (and driver 418 included), we are having issues profiling. Originally we had the “no permissions” issue, which we fixed using the modprobe.d configuration fix.

Now we get a signal 139 if we profile any application that uses unified memory.
https://imgur.com/a/Xjdf7Ok
As you can see, it happens with or without root access.
It works if we disable unified memory profiling:
https://imgur.com/a/nvdAju3
It also works if we do not use unified memory at all:
https://imgur.com/a/csTIS5X

You can see that our runtime/configuration/driver all matches up:
https://imgur.com/a/papoRnR

We also tried the 430 driver, without success.

Full disclosure: I did have to patch the driver version 418 to get it to work with our 5.1.5 and now 5.1.8 kernel. I did not have to patch 430. The patches were not functional changes; simply changes to some of the function interfaces (i.e. change int to unsigned int).

You can find the contents of the patch here:
https://gist.github.com/tallendev/bdd3965313f01df2f48b2ade709e4931

I guess it’s possible I won’t be able to get help since the kernel does not match the driver. However, I don’t think the changes matter much. It seems like a deeper issue/ return of an old bug from cuda ~7/8, but i’m not sure. The last time it worked was on cuda 9.2.

If anyone has any suggestions, that would be great. Maybe this would better serve as a bug report.
Thanks.

Hi… I am having exact same issue with the same CUDA 10.1 toolkit. Were you able to get this resolved? Thank you for the reply.

Hi StereoGraphics,

May I ask you to give a try to the CUDA 10.2 toolkit? If you can wait, it’s be better to use CUDA 11, which will be available soon.

If this issue still occurs, having more details would help us to inspect the issue at our end. We need details about the GPU used, and a minimal reproducer.

This still happens for me on 10.2.
NVIDIA-SMI 440.33.01 Driver Version: 440.33.01 CUDA Version: 10.2
Installed from cuda_10.2.89_440.33.01_linux.run for Fedora
5.3.11-100.fc29.x86_64 #1 SMP Tue Nov 12 20:41:25 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
4x Titan V GPUs
AMD EPYC 7551

When it happens, in DMESG i get:

[82999.427686] break[75026]: segfault at 0 ip 00007f3df4b29a79 sp 00007f3dedfcdad8 error 4 in libc-2.28.so[7f3df4aa6000+14d000]
[82999.427700] Code: c3 0f b7 4c 16 fe 0f b7 36 66 89 4c 17 fe 66 89 37 c3 48 81 fa 00 08 00 00 77 8a 48 81 fa 80 00 00 00 77 70 48 83 fa 40 72 47 <0f> 10 06 0f 10 4e 10 0f 10 56 20 0f 10 5e 30 0f 10 64 16 f0 0f 10

ldd which nvprof
linux-vdso.so.1 (0x00007ffc29ff4000)
libcupti.so.10.2 => /usr/local/cuda-10.2/bin/…/extras/CUPTI/lib64/libcupti.so.10.2 (0x00007fa7e7b92000)
libdl.so.2 => /lib64/libdl.so.2 (0x00007fa7e7b7c000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x00007fa7e7b5b000)
librt.so.1 => /lib64/librt.so.1 (0x00007fa7e7b51000)
libstdc++.so.6 => /lib64/libstdc++.so.6 (0x00007fa7e79b9000)
libm.so.6 => /lib64/libm.so.6 (0x00007fa7e7835000)
libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007fa7e7818000)
libc.so.6 => /lib64/libc.so.6 (0x00007fa7e7652000)
libutil.so.1 => /lib64/libutil.so.1 (0x00007fa7e764d000)
/lib64/ld-linux-x86-64.so.2 (0x00007fa7e8348000)

Minimal Reproducer:


Initializing memory on CPU still causes error, however the crash seems to happen on the kernel call. W/o kernel call and with cpu-side initialization, nvprof works and tells me I get one page fault (as expected).

I am about to try CUDA-11.0 because we need this functionality again. I resolved it before by rolling back to CUDA 9.0, but we have since done a clean reinstall and were using 10.2 for a while before needing this again. It seems that Fedora has been dropped from the supported x86 architectures, so I’m not sure what to expect but will report back if install is successful…