Unified Memory Signal 139 Cuda 10.1

tnallen · June 17, 2019, 11:05pm

Hi all,

After upgrading to the CUDA toolkit 10.1 (and driver 418 included), we are having issues profiling. Originally we had the “no permissions” issue, which we fixed using the modprobe.d configuration fix.

Now we get a signal 139 if we profile any application that uses unified memory.

[Album] imgur.com

As you can see, it happens with or without root access.
It works if we disable unified memory profiling:

[Album] imgur.com

It also works if we do not use unified memory at all:

[Album] imgur.com

You can see that our runtime/configuration/driver all matches up:

[Album] imgur.com

We also tried the 430 driver, without success.

Full disclosure: I did have to patch the driver version 418 to get it to work with our 5.1.5 and now 5.1.8 kernel. I did not have to patch 430. The patches were not functional changes; simply changes to some of the function interfaces (i.e. change int to unsigned int).

You can find the contents of the patch here:

gist.github.com

https://gist.github.com/tallendev/bdd3965313f01df2f48b2ade709e4931

nv.patch

diff -uNr NVIDIA-Linux-x86_64-418.67.old/kernel/nv_compiler.h NVIDIA-Linux-x86_64-418.67/kernel/nv_compiler.h
--- NVIDIA-Linux-x86_64-418.67.old/kernel/nv_compiler.h	1969-12-31 19:00:00.000000000 -0500
+++ NVIDIA-Linux-x86_64-418.67/kernel/nv_compiler.h	2019-06-17 14:24:00.256556017 -0400
@@ -0,0 +1 @@
+#define NV_COMPILER "gcc version 9.1.1 20190503 (Red Hat 9.1.1-1) (GCC) "
diff -uNr NVIDIA-Linux-x86_64-418.67.old/kernel/nvidia-drm/nvidia-drm-connector.c NVIDIA-Linux-x86_64-418.67/kernel/nvidia-drm/nvidia-drm-connector.c
--- NVIDIA-Linux-x86_64-418.67.old/kernel/nvidia-drm/nvidia-drm-connector.c	2019-04-06 04:30:19.000000000 -0400
+++ NVIDIA-Linux-x86_64-418.67/kernel/nvidia-drm/nvidia-drm-connector.c	2019-06-17 15:14:28.254981745 -0400
@@ -31,6 +31,7 @@
 #include "nvidia-drm-encoder.h"

This file has been truncated. show original

I guess it’s possible I won’t be able to get help since the kernel does not match the driver. However, I don’t think the changes matter much. It seems like a deeper issue/ return of an old bug from cuda ~7/8, but i’m not sure. The last time it worked was on cuda 9.2.

If anyone has any suggestions, that would be great. Maybe this would better serve as a bug report.
Thanks.

StereoGraphics · May 22, 2020, 6:35pm

Hi… I am having exact same issue with the same CUDA 10.1 toolkit. Were you able to get this resolved? Thank you for the reply.

mjain · May 27, 2020, 2:21pm

Hi StereoGraphics,

May I ask you to give a try to the CUDA 10.2 toolkit? If you can wait, it’s be better to use CUDA 11, which will be available soon.

If this issue still occurs, having more details would help us to inspect the issue at our end. We need details about the GPU used, and a minimal reproducer.

tnallen · August 6, 2020, 4:03am

This still happens for me on 10.2.
NVIDIA-SMI 440.33.01 Driver Version: 440.33.01 CUDA Version: 10.2
Installed from cuda_10.2.89_440.33.01_linux.run for Fedora
5.3.11-100.fc29.x86_64 #1 SMP Tue Nov 12 20:41:25 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
4x Titan V GPUs
AMD EPYC 7551

When it happens, in DMESG i get:

[82999.427686] break[75026]: segfault at 0 ip 00007f3df4b29a79 sp 00007f3dedfcdad8 error 4 in libc-2.28.so[7f3df4aa6000+14d000]
[82999.427700] Code: c3 0f b7 4c 16 fe 0f b7 36 66 89 4c 17 fe 66 89 37 c3 48 81 fa 00 08 00 00 77 8a 48 81 fa 80 00 00 00 77 70 48 83 fa 40 72 47 <0f> 10 06 0f 10 4e 10 0f 10 56 20 0f 10 5e 30 0f 10 64 16 f0 0f 10

ldd which nvprof
linux-vdso.so.1 (0x00007ffc29ff4000)
libcupti.so.10.2 => /usr/local/cuda-10.2/bin/…/extras/CUPTI/lib64/libcupti.so.10.2 (0x00007fa7e7b92000)
libdl.so.2 => /lib64/libdl.so.2 (0x00007fa7e7b7c000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x00007fa7e7b5b000)
librt.so.1 => /lib64/librt.so.1 (0x00007fa7e7b51000)
libstdc++.so.6 => /lib64/libstdc++.so.6 (0x00007fa7e79b9000)
libm.so.6 => /lib64/libm.so.6 (0x00007fa7e7835000)
libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007fa7e7818000)
libc.so.6 => /lib64/libc.so.6 (0x00007fa7e7652000)
libutil.so.1 => /lib64/libutil.so.1 (0x00007fa7e764d000)
/lib64/ld-linux-x86-64.so.2 (0x00007fa7e8348000)

Minimal Reproducer:

Initializing memory on CPU still causes error, however the crash seems to happen on the kernel call. W/o kernel call and with cpu-side initialization, nvprof works and tells me I get one page fault (as expected).

I am about to try CUDA-11.0 because we need this functionality again. I resolved it before by rolling back to CUDA 9.0, but we have since done a clean reinstall and were using 10.2 for a while before needing this again. It seems that Fedora has been dropped from the supported x86 architectures, so I’m not sure what to expect but will report back if install is successful…

Topic		Replies	Views
nvprof error code 139 but memcheck OK Visual Profiler and nvprof	14	13674	December 11, 2020
unified memory profiling failed Visual Profiler and nvprof	12	6103	June 17, 2018
NVProf error on samples CUDA Programming and Performance	28	20388	December 29, 2020
Nvprof - Unified Memory profiling failed [solved] Visual Profiler and nvprof	7	4729	June 2, 2019
[RESOLVED] Profiling error 4168:999 Visual Profiler and nvprof	34	11226	September 19, 2020
NVPROF with Error: incompatible CUDA driver version. Visual Profiler and nvprof	1	1413	January 3, 2020
"Unified Memory Profiling is not supported ..." warning 3348 Visual Profiler and nvprof	15	5681	September 20, 2018
nvprof: Warning: The user does not have permission to profile on the target device. Visual Profiler and nvprof	20	24846	October 12, 2021
nvprof core dumps on Ubuntu 16.04 CUDA Setup and Installation	12	3559	August 16, 2018
Incompatible CUDA driver version Visual Profiler and nvprof cuda	2	1513	July 29, 2021

Unified Memory Signal 139 Cuda 10.1

Related Topics