Nv-nsight-cu-cli segfault

I seem to get a segfault with nv-nsight-cu-cli tries to run an application. This also happens when the gui version invokes it. It is application-independent; see the following output from a CUDA samples program.

nv-nsight-cu-cli ./matrixMul
[Matrix Multiply Using CUDA] - Starting…
==PROF== Connected to process 1484 (/home/[user]/build/NVIDIA_CUDA-10.0_Samples/0_Simple/matrixMul/matrixMul)
==ERROR== The application returned an error code (11)
==WARNING== No kernels were profiled
==WARNING== Profiling kernels launched by child processes requires the --target-processes all option

dmesg shows this information:
[ 502.772847] show_signal_msg: 51 callbacks suppressed
[ 502.772851] matrixMul[2360]: segfault at 38 ip 00007fc370e3ecc1 sp 00007fc36fd47970 error 4 in libcuda-injection.so[7fc370c9e000+1206000]
[ 502.772864] Code: c3 0f 1f 84 00 00 00 00 00 55 53 ba 07 00 00 00 89 fb 40 0f b6 ff 48 83 ec 08 48 8d 05 18 f7 8f 02 8b 35 12 b3 83 02 48 8b 00 50 38 85 c0 74 11 48 8d 2d 01 79 2c 01 0f b7 45 08 66 83 f8 01

Package installed via runfile -
System settings:
Fedora 29, Linux 5.3.11-100.fc29.x86_64
gcc (GCC) 8.3.1 20190223 (Red Hat 8.3.1-2)
NVIDIA-SMI 440.33.01 Driver Version: 440.33.01 CUDA Version: 10.2
Titan V GPU
I receive the same issue with the 2019.5 release distributed here (https://developer.nvidia.com/nsight-compute)

This happens on another system with slightly different hardware config and Titan Xp as well, although Xp does not appear to be supported (https://docs.nvidia.com/nsight-compute/ReleaseNotes/index.html#gpu-support) so that’s not really surprising.

Any ideas? Happy to provide any other info, etc.

We are not aware of any related issue. You could try the following to narrow down the potential cause:

  • Ensure that the application runs fine without the tool. You likely already did that, so just double-checking.
  • Make sure that the nv-nsight-cu-cli executable really is the one you intend to run. It looks like you are starting it from $PATH, so you could try using the absolute path to nv-nsight-cu-cli, or alternatively check the output of nv-nsight-cu-cli --version.
  • Try running the tool with elevated privileges (sudo), if not done already.
  • Let us know the output of the “locale” command in your shell
  • Try collecting only a single metric, i.e. “nv-nsight-cu-cli --metrics device__attribute_display_name ./matrixMul”

Please let us know the results for these tests.

  • Ensure that the application runs fine without the tool. You likely already did that, so just double-checking.
    To demo, here’s actually two applications. For the second, longer application, the crash seems to happen around the time of the first CUDA runtime activity.

./matrixMul
[Matrix Multiply Using CUDA] - Starting…
GPU Device 0: “TITAN V” with compute capability 7.0
MatrixA(320,320), MatrixB(640,320)
Computing result using CUDA Kernel…
done
Performance= 2539.18 GFlop/s, Time= 0.052 msec, Size= 131072000 Ops, WorkgroupSize= 1024 threads/block
Checking computed result for correctness: Result = PASS
NOTE: The CUDA Samples are not meant for performancemeasurements. Results may vary when GPU Boost is enabled.

nv-nsight-cu-cli ./matrixMul
[Matrix Multiply Using CUDA] - Starting…
==PROF== Connected to process 72453 (/home/tnallen/build/NVIDIA_CUDA-10.0_Samples/0_Simple/matrixMul/matrixMul)
==ERROR== The application returned an error code (11)
==WARNING== No kernels were profiled
==WARNING== Profiling kernels launched by child processes requires the --target-processes all option

./w2v -read-vocab vocab1/text8.vocab -read-corpus-cache corpi/cache/text8.corpus_cache -debug 1 -output grad2.emb -epoch 20 -threads 4 -streams 4
number of epochs: 20
max-sentence-length: 1000
sample: 0.001000
window size: 5
initial learning rate: 0.02500
number of negative samples: 5
layer size: 100
kernel batch size: 200
streams per thread: 4
number of threads: 4
train using file:
Loading vocabulary from file: vocab1/text8.vocab
Vocab size: 71291
Words in train file: 16718844
Loading corpus cache from file: corpi/cache/text8.corpus_cache
Load 99.98%
Loaded
Initializing network
Network ready
Kernel Selection: neg_w2v_kernel_small – Block Structure == x <negatives+1>
Kernel Structure: <<<(6,200,1), (100,1,1)>>>
— cut for brevity —
Saving model to: grad2.emb

nv-nsight-cu-cli ./w2v -read-vocab vocab1/text8.vocab -read-corpus-cache corpi/cache/text8.corpus_cache -debug 1 -output grad2.emb -epoch 20 -threads 4 -streams 4
number of epochs: 20
max-sentence-length: 1000
sample: 0.001000
window size: 5
initial learning rate: 0.02500
number of negative samples: 5
layer size: 100
kernel batch size: 200
streams per thread: 4
number of threads: 4
train using file:
Loading vocabulary from file: vocab1/text8.vocab
Vocab size: 71291
Words in train file: 16718844
Loading corpus cache from file: corpi/cache/text8.corpus_cache
Load 99.98%
Loaded
Initializing network
==PROF== Connected to process 72615 (/home/tnallen/dev/word2vec_2/w2v)
==ERROR== The application returned an error code (11)
==WARNING== No kernels were profiled
==WARNING== Profiling kernels launched by child processes requires the --target-processes all option

  • **Make sure that the nv-nsight-cu-cli executable really is the one you intend to run. It looks like you are starting it from PATH, so you could try using the absolute path to nv-nsight-cu-cli, or alternatively check the output of nv-nsight-cu-cli --version.** [tnallen@voltron matrixMul] nv-nsight-cu-cli --version
    NVIDIA ® Nsight Compute Command Line Profiler
    Copyright © 2012-2019 NVIDIA Corporation
    Version 2019.5.0 (Build 27346997)

/usr/local/cuda-10.2/bin/nv-nsight-cu-cli ./matrixMul
[Matrix Multiply Using CUDA] - Starting…
==PROF== Connected to process 72728 (/home/tnallen/build/NVIDIA_CUDA-10.0_Samples/0_Simple/matrixMul/matrixMul)
==ERROR== The application returned an error code (11)
==WARNING== No kernels were profiled
==WARNING== Profiling kernels launched by child processes requires the --target-processes all option

  • Try running the tool with elevated privileges (sudo), if not done already.
    Same results as below if i login with a proper root env, as well:

sudo /usr/local/cuda-10.2/bin/nv-nsight-cu-cli ./matrixMul
[Matrix Multiply Using CUDA] - Starting…
==PROF== Connected to process 72781 (/home/tnallen/build/NVIDIA_CUDA-10.0_Samples/0_Simple/matrixMul/matrixMul)
==ERROR== The application returned an error code (11)
==WARNING== No kernels were profiled
==WARNING== Profiling kernels launched by child processes requires the --target-processes all option

  • Let us know the output of the “locale” command in your shell
    locale
    LANG=en_US.UTF-8
    LC_CTYPE=“en_US.UTF-8”
    LC_NUMERIC=“en_US.UTF-8”
    LC_TIME=“en_US.UTF-8”
    LC_COLLATE=“en_US.UTF-8”
    LC_MONETARY=“en_US.UTF-8”
    LC_MESSAGES=“en_US.UTF-8”
    LC_PAPER=“en_US.UTF-8”
    LC_NAME=“en_US.UTF-8”
    LC_ADDRESS=“en_US.UTF-8”
    LC_TELEPHONE=“en_US.UTF-8”
    LC_MEASUREMENT=“en_US.UTF-8”
    LC_IDENTIFICATION=“en_US.UTF-8”
    LC_ALL=

  • Try collecting only a single metric, i.e. “nv-nsight-cu-cli --metrics device__attribute_display_name ./matrixMul”
    nv-nsight-cu-cli --metrics device__attribute_display_name ./matrixMul
    [Matrix Multiply Using CUDA] - Starting…
    ==PROF== Connected to process 73160 (/home/tnallen/build/NVIDIA_CUDA-10.0_Samples/0_Simple/matrixMul/matrixMul)
    ==ERROR== The application returned an error code (11)
    ==WARNING== No kernels were profiled
    ==WARNING== Profiling kernels launched by child processes requires the --target-processes all option

dmesg gives me an identical message down to the values for all of these:
[238884.026445] matrixMul[73163]: segfault at 38 ip 00007f52dadf5cc1 sp 00007f52d9cfd970 error 4 in libcuda-injection.so[7f52dac55000+1206000]
[238884.026455] Code: c3 0f 1f 84 00 00 00 00 00 55 53 ba 07 00 00 00 89 fb 40 0f b6 ff 48 83 ec 08 48 8d 05 18 f7 8f 02 8b 35 12 b3 83 02 48 8b 00 50 38 85 c0 74 11 48 8d 2d 01 79 2c 01 0f b7 45 08 66 83 f8 01

Here is some valgrind output I generated on a longshot; could the glibc version mismatch be responsible?:
nv-nsight-cu-cli ./matrixMul
==73999== Memcheck, a memory error detector
==73999== Copyright © 2002-2017, and GNU GPL’d, by Julian Seward et al.
==73999== Using Valgrind-3.15.0 and LibVEX; rerun with -h for copyright info
==73999== Command: /usr/local/cuda-10.2/bin/…/nsight-compute-2019.5.0/target/linux-desktop-glibc_2_11_3-x64/nv-nsight-cu-cli ./matrixMul
==73999==
[Matrix Multiply Using CUDA] - Starting…
==PROF== Connected to process 74018 (/home/tnallen/build/NVIDIA_CUDA-10.0_Samples/0_Simple/matrixMul/matrixMul)
==ERROR== The application returned an error code (11)
==WARNING== No kernels were profiled
==WARNING== Profiling kernels launched by child processes requires the --target-processes all option
==73999== Invalid read of size 1
==73999== at 0x572D99: ??? (in /usr/local/cuda-10.2/nsight-compute-2019.5.0/target/linux-desktop-glibc_2_11_3-x64/nv-nsight-cu-cli)
==73999== by 0x5CA253: ??? (in /usr/local/cuda-10.2/nsight-compute-2019.5.0/target/linux-desktop-glibc_2_11_3-x64/nv-nsight-cu-cli)
==73999== by 0x5CB459: ??? (in /usr/local/cuda-10.2/nsight-compute-2019.5.0/target/linux-desktop-glibc_2_11_3-x64/nv-nsight-cu-cli)
==73999== by 0x49F2E0: ??? (in /usr/local/cuda-10.2/nsight-compute-2019.5.0/target/linux-desktop-glibc_2_11_3-x64/nv-nsight-cu-cli)
==73999== by 0x433E45: std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() (in /usr/local/cuda-10.2/nsight-compute-2019.5.0/target/linux-desktop-glibc_2_11_3-x64/nv-nsight-cu-cli)
==73999== by 0x42B0CB: ??? (in /usr/local/cuda-10.2/nsight-compute-2019.5.0/target/linux-desktop-glibc_2_11_3-x64/nv-nsight-cu-cli)
==73999== by 0x41E48F: ??? (in /usr/local/cuda-10.2/nsight-compute-2019.5.0/target/linux-desktop-glibc_2_11_3-x64/nv-nsight-cu-cli)
==73999== by 0x4A4E412: (below main) (in /usr/lib64/libc-2.28.so)
==73999== Address 0x4f5c268 is 8 bytes inside a block of size 88 free’d
==73999== at 0x4838A0C: free (vg_replace_malloc.c:540)
==73999== by 0x553DAD: ??? (in /usr/local/cuda-10.2/nsight-compute-2019.5.0/target/linux-desktop-glibc_2_11_3-x64/nv-nsight-cu-cli)
==73999== by 0x554288: ??? (in /usr/local/cuda-10.2/nsight-compute-2019.5.0/target/linux-desktop-glibc_2_11_3-x64/nv-nsight-cu-cli)
==73999== by 0x53212E: ??? (in /usr/local/cuda-10.2/nsight-compute-2019.5.0/target/linux-desktop-glibc_2_11_3-x64/nv-nsight-cu-cli)
==73999== by 0x5321D7: ??? (in /usr/local/cuda-10.2/nsight-compute-2019.5.0/target/linux-desktop-glibc_2_11_3-x64/nv-nsight-cu-cli)
==73999== by 0x530F9A: ??? (in /usr/local/cuda-10.2/nsight-compute-2019.5.0/target/linux-desktop-glibc_2_11_3-x64/nv-nsight-cu-cli)
==73999== by 0x53A8A1: ??? (in /usr/local/cuda-10.2/nsight-compute-2019.5.0/target/linux-desktop-glibc_2_11_3-x64/nv-nsight-cu-cli)
==73999== by 0x53EFF3: ??? (in /usr/local/cuda-10.2/nsight-compute-2019.5.0/target/linux-desktop-glibc_2_11_3-x64/nv-nsight-cu-cli)
==73999== by 0x52E8D1: ??? (in /usr/local/cuda-10.2/nsight-compute-2019.5.0/target/linux-desktop-glibc_2_11_3-x64/nv-nsight-cu-cli)
==73999== by 0x52EBD7: ??? (in /usr/local/cuda-10.2/nsight-compute-2019.5.0/target/linux-desktop-glibc_2_11_3-x64/nv-nsight-cu-cli)
==73999== by 0x52ECF8: ??? (in /usr/local/cuda-10.2/nsight-compute-2019.5.0/target/linux-desktop-glibc_2_11_3-x64/nv-nsight-cu-cli)
==73999== by 0x4DBEB2: ??? (in /usr/local/cuda-10.2/nsight-compute-2019.5.0/target/linux-desktop-glibc_2_11_3-x64/nv-nsight-cu-cli)
==73999== Block was alloc’d at
==73999== at 0x483780B: malloc (vg_replace_malloc.c:309)
==73999== by 0xBE0787: ??? (in /usr/local/cuda-10.2/nsight-compute-2019.5.0/target/linux-desktop-glibc_2_11_3-x64/nv-nsight-cu-cli)
==73999== by 0x57144E: ??? (in /usr/local/cuda-10.2/nsight-compute-2019.5.0/target/linux-desktop-glibc_2_11_3-x64/nv-nsight-cu-cli)
==73999== by 0x5CDFC8: ??? (in /usr/local/cuda-10.2/nsight-compute-2019.5.0/target/linux-desktop-glibc_2_11_3-x64/nv-nsight-cu-cli)
==73999== by 0x5CF546: ??? (in /usr/local/cuda-10.2/nsight-compute-2019.5.0/target/linux-desktop-glibc_2_11_3-x64/nv-nsight-cu-cli)
==73999== by 0x48C173: ??? (in /usr/local/cuda-10.2/nsight-compute-2019.5.0/target/linux-desktop-glibc_2_11_3-x64/nv-nsight-cu-cli)
==73999== by 0x428F0F: ??? (in /usr/local/cuda-10.2/nsight-compute-2019.5.0/target/linux-desktop-glibc_2_11_3-x64/nv-nsight-cu-cli)
==73999== by 0x42AF95: ??? (in /usr/local/cuda-10.2/nsight-compute-2019.5.0/target/linux-desktop-glibc_2_11_3-x64/nv-nsight-cu-cli)
==73999== by 0x41E48F: ??? (in /usr/local/cuda-10.2/nsight-compute-2019.5.0/target/linux-desktop-glibc_2_11_3-x64/nv-nsight-cu-cli)
==73999== by 0x4A4E412: (below main) (in /usr/lib64/libc-2.28.so)
---------- snip ----------

We are so far unable to reproduce the issue on our end using a Fedora 29 system with the same compiler (gcc (GCC) 8.3.1 20190223 ), CUDA sample (matrixMul), CUDA toolkit (10.2), display driver (440.33.01) and Nsight Compute (2019.5.0 (Build 27346997)) version. The only noticeable difference is the Linux kernel (4.18.16-300.fc29.x86_64 vs 5.3.11-100.fc29.x86_64) which I would assume to be not the issue here.

As for the valgrind error you pasted, I am not certain they are strictly related. Valgrind reports the issue in the host process of the Nsight Compute command line (nv-nsight-cu-cli) while the segmentation fault seems to occur within the target application process (The application returned an error code (11)).

So far, I don’t believe the gcc version would be an issue, the glibc version noted in the Nsight Compute target directory indicates minimal requirements, and the tool is compatible with systems using newer versions of glibc, too.

Would it be possible for you to run the tool and application under gdb, using its “set follow-fork-mode child” command, and provide us the output of “bt” as well as “info sharedlibrary” at the point of the segfault in the application process?

nv-nsight-cu-cli ./matrixMul
GNU gdb (GDB) Fedora 8.2-7.fc29
Copyright © 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type “show copying” and “show warranty” for details.
This GDB was configured as “x86_64-redhat-linux-gnu”.
Type “show configuration” for configuration details.
For bug reporting instructions, please see:
http://www.gnu.org/software/gdb/bugs/.
Find the GDB manual and other documentation resources online at:
http://www.gnu.org/software/gdb/documentation/.

For help, type “help”.
Type “apropos word” to search for commands related to “word”…
Reading symbols from /usr/local/cuda-10.2/bin/…/nsight-compute-2019.5.0/target/linux-desktop-glibc_2_11_3-x64/nv-nsight-cu-cli…(no debugging symbols found)…done.
(gdb) set follow-fork-mode child
(gdb) r
Starting program: /usr/local/cuda-10.2/nsight-compute-2019.5.0/target/linux-desktop-glibc_2_11_3-x64/nv-nsight-cu-cli ./matrixMul
warning: Loadable section “.note.gnu.property” outside of ELF segments
[Thread debugging using libthread_db enabled]
Using host libthread_db library “/lib64/libthread_db.so.1”.
warning: Loadable section “.note.gnu.property” outside of ELF segments
[New Thread 0x7fffeac75700 (LWP 20816)]
[New Thread 0x7fffea474700 (LWP 20817)]
[Attaching after Thread 0x7ffff7c20780 (LWP 20805) fork to child process 20818]
[New inferior 2 (process 20818)]
[Detaching after fork from parent process 20805]
[Inferior 1 (process 20805) detached]
[Thread debugging using libthread_db enabled]
Using host libthread_db library “/lib64/libthread_db.so.1”.
process 20818 is executing new program: /usr/local/cuda-10.2/nsight-compute-2019.5.0/target/linux-desktop-glibc_2_11_3-x64/TreeLauncherSubreaper
warning: Loadable section “.note.gnu.property” outside of ELF segments
[Thread debugging using libthread_db enabled]
Using host libthread_db library “/lib64/libthread_db.so.1”.
warning: Loadable section “.note.gnu.property” outside of ELF segments
[Attaching after Thread 0x7ffff710b900 (LWP 20818) fork to child process 20822]
[New inferior 3 (process 20822)]
[Detaching after fork from parent process 20818]
[Inferior 2 (process 20818) detached]
[Thread debugging using libthread_db enabled]
Using host libthread_db library “/lib64/libthread_db.so.1”.
process 20822 is executing new program: /home/tnallen/build/NVIDIA_CUDA-10.0_Samples/0_Simple/matrixMul/matrixMul
warning: Loadable section “.note.gnu.property” outside of ELF segments
[Thread debugging using libthread_db enabled]
Using host libthread_db library “/lib64/libthread_db.so.1”.
warning: Loadable section “.note.gnu.property” outside of ELF segments
warning: Loadable section “.note.gnu.property” outside of ELF segments
[Matrix Multiply Using CUDA] - Starting…
[New Thread 0x7ffff4a5b700 (LWP 20824)]
[New Thread 0x7ffff425a700 (LWP 20825)]
[New Thread 0x7fffe6aaf700 (LWP 20826)]
==PROF== Connected to process 20822 (/home/tnallen/build/NVIDIA_CUDA-10.0_Samples/0_Simple/matrixMul/matrixMul)

Thread 3.3 “matrixMul” received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff425a700 (LWP 20825)]
0x00007ffff5351cc1 in ?? () from /usr/local/cuda-10.2/nsight-compute-2019.5.0/target/linux-desktop-glibc_2_11_3-x64/./libcuda-injection.so
(gdb) bt
#0 0x00007ffff5351cc1 in ?? () from /usr/local/cuda-10.2/nsight-compute-2019.5.0/target/linux-desktop-glibc_2_11_3-x64/./libcuda-injection.so
#1 0x00007ffff5369cb4 in ?? () from /usr/local/cuda-10.2/nsight-compute-2019.5.0/target/linux-desktop-glibc_2_11_3-x64/./libcuda-injection.so
#2 0x00007ffff549b49b in ?? () from /usr/local/cuda-10.2/nsight-compute-2019.5.0/target/linux-desktop-glibc_2_11_3-x64/./libcuda-injection.so
#3 0x00007ffff5bc9202 in ?? () from /usr/local/cuda-10.2/nsight-compute-2019.5.0/target/linux-desktop-glibc_2_11_3-x64/./libcuda-injection.so
#4 0x00007ffff5bc9961 in ?? () from /usr/local/cuda-10.2/nsight-compute-2019.5.0/target/linux-desktop-glibc_2_11_3-x64/./libcuda-injection.so
#5 0x00007ffff5d6cb32 in ?? () from /usr/local/cuda-10.2/nsight-compute-2019.5.0/target/linux-desktop-glibc_2_11_3-x64/./libcuda-injection.so
#6 0x00007ffff51814aa in start_thread (arg=) at pthread_create.c:479
#7 0x00007ffff4d723f3 in clone () at …/sysdeps/unix/sysv/linux/x86_64/clone.S:95
(gdb) info sharedlibrary
From To Syms Read Shared Object Library
0x00007ffff7fd4110 0x00007ffff7ff31f4 Yes /lib64/ld-linux-x86-64.so.2
0x00007ffff7c70b00 0x00007ffff7d7c63c Yes () /usr/local/cuda-10.2/nsight-compute-2019.5.0/target/linux-desktop-glibc_2_11_3-x64/./libTreeLauncherTargetUpdatePreloadInjection.so
0x00007ffff52f34d0 0x00007ffff5e04a4c Yes (
) /usr/local/cuda-10.2/nsight-compute-2019.5.0/target/linux-desktop-glibc_2_11_3-x64/./libcuda-injection.so
0x00007ffff519c710 0x00007ffff519fa30 Yes /lib64/librt.so.1
0x00007ffff517fb50 0x00007ffff518de65 Yes /lib64/libpthread.so.0
0x00007ffff5174270 0x00007ffff5175069 Yes /lib64/libdl.so.2
0x00007ffff5069a10 0x00007ffff511fbe2 Yes /lib64/libstdc++.so.6
0x00007ffff4e62490 0x00007ffff4f0165a Yes /lib64/libm.so.6
0x00007ffff4e3d590 0x00007ffff4e4e1f5 Yes /lib64/libgcc_s.so.1
0x00007ffff4c96670 0x00007ffff4de189f Yes /lib64/libc.so.6
0x00007ffff4c703f0 0x00007ffff4c70d28 Yes /lib64/libutil.so.1
0x00007ffff4a61fc0 0x00007ffff4a63be2 Yes () /usr/local/cuda-10.2/lib64/stubs/libcuda.so
0x00007ffff51a5130 0x00007ffff51a607d Yes /usr/lib64/gconv/UTF-32.so
(
): Shared library is missing debugging information.
(gdb)

This seems consistent with the dmesg output:
[326337.496629] matrixMul[20748]: segfault at 38 ip 00007ffff5351cc1 sp 00007ffff4259970 error 4 in libcuda-injection.so[7ffff51b1000+1206000]
[326337.496642] Code: c3 0f 1f 84 00 00 00 00 00 55 53 ba 07 00 00 00 89 fb 40 0f b6 ff 48 83 ec 08 48 8d 05 18 f7 8f 02 8b 35 12 b3 83 02 48 8b 00 50 38 85 c0 74 11 48 8d 2d 01 79 2c 01 0f b7 45 08 66 83 f8 01

If there’s a way for me to provide those missing symbols in the backtrace, I’m happy to do that, but I’m not aware of them being provided.

You can’t provide these symbols, but we can use the backtrace internally to resolve them.

0x00007ffff4a61fc0 0x00007ffff4a63be2 Yes ( ) /usr/local/cuda-10.2/lib64/stubs/libcuda.so

Looking at this line, can you try to remove this from LD_LIBRARY_PATH and see if that solves the issue? (another temporary workaround could be to rename or move this stub). We have seen problems with stub cuda libraries being on the library path causing similar problems. If that in fact solves the problem, it should already be addressed in the next version of Nsight Compute.

That was it. Looks like it is working now. Thanks a lot for your time and help!