Nsight Profiler Hangs on OpenMP Initialization

I’m trying to profile a hybrid CUDA + OpenMP program using the Nsight profiler to verify concurrent host and device execution in the timeline. However, when trying to profile OpenMP, the program just hangs in the profiler injection stage. As a minimal example, this is also reproducible with a simple OpenMP Hello World program:

// Compile with clang -fopenmp=libomp <filename>
// OMP_PROC_BIND=spread, OMP_PLACES=threads exported beforehand
// Profile with
//    OMP_NUM_THREADS=4 nsys profile --trace=openmp,osrt --output=minimal ./a.out
#include <omp.h>
#include <stdio.h>
#include <stdlib.h>

int main(int argc, char *argv[]) {
#pragma omp parallel
  { printf("Hello World from thread = %d\n", omp_get_thread_num()); }

  return 0;

While the program is frozen, attaching GDB to the process reveals where it’s stuck (here I copied the stack trace from the build with Clang 12 but the same issue occurs with Clang 15):

#0  0x00007f84181ce7e0 in ?? () from <cuda installation>/12.1.1/nsight-systems-2023.1.2/target-linux-x64/libToolsInjection64.so
#1  0x00007f8417efb76d in ?? () from <cuda installation>/12.1.1/nsight-systems-2023.1.2/target-linux-x64/libToolsInjection64.so
#2  0x00007f8417efad0f in ?? () from <cuda installation>/12.1.1/nsight-systems-2023.1.2/target-linux-x64/libToolsInjection64.so
#3  0x00007f8417efc186 in ?? () from <cuda installation>/12.1.1/nsight-systems-2023.1.2/target-linux-x64/libToolsInjection64.so
#4  0x00007f8417efa6e2 in ?? () from <cuda installation>/12.1.1/nsight-systems-2023.1.2/target-linux-x64/libToolsInjection64.so
#5  0x00007f8417ef8c36 in ?? () from <cuda installation>/12.1.1/nsight-systems-2023.1.2/target-linux-x64/libToolsInjection64.so
#6  0x00007f8417ef987c in ?? () from <cuda installation>/12.1.1/nsight-systems-2023.1.2/target-linux-x64/libToolsInjection64.so
#7  0x00007f8417c4866c in NSYS_DL_dlsym () from <cuda installation>/12.1.1/nsight-systems-2023.1.2/target-linux-x64/libToolsInjection64.so
#8  0x00007f841971b00f in ompt_start_tool () from <clang installation>/12.0.0_rhel7/lib64/libomp.so
#9  0x00007f841971b14d in ompt_pre_init () from <clang installation>/12.0.0_rhel7/lib64/libomp.so
#10 0x00007f84196b36ff in __kmp_do_serial_initialize() () from <clang installation>/12.0.0_rhel7/lib64/libomp.so
#11 0x00007f84196b3e3d in __kmp_get_global_thread_id_reg () from <clang installation>/12.0.0_rhel7/lib64/libomp.so
#12 0x00007f84196a5d34 in __kmpc_fork_call () from <clang installation>/12.0.0_rhel7/lib64/libomp.so
#13 0x0000000000400683 in main ()

Versions used (info for Clang 15):

$ nsys --version
NVIDIA Nsight Systems version 2023.1.2.43-32377213v0
$ clang --version
clang version 15.0.6
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: <clang installation>/15.0.6_rhel7/bin

This issue does not occur with the Nsight profiler included with CUDA 11.2.2. However, the project must be built with CUDA 12 since it uses C++20 features.

If I compile with CUDA 12 and then profile with CUDA 11, the profiler doesn’t always show all the CUDA kernels in the program. Based on other posts and the warnings in Nsight, I assume that a version mismatch between the profiler and the device drivers is likely the reason for this.

Is this an issue with the profiler or is there something I need to configure to resolve this? Thanks.

Hi @alessandro.vinciguerra - thank you for reporting the problem. Could you try using the nsys version 2023.2 from the website Nsight Systems | NVIDIA Developer? Does the OpenMP app hang even with the new version of nsys? If so, @liuyis will be able to help.

The nsys included in the CUDA 11 toolkit is too old to support CUDA 12 applications, so the missing CUDA kernels is expected.

Hi @skottapalli, thanks for the reply. I installed version 2023.2.1 via the Linux Host .run installer but the program still hangs, leaving the same stack trace while frozen.

Thanks Sneha, I’ll take a look.

@alessandro.vinciguerra Somehow I am not able to reproduce this issue locally, Nsys worked with both Clang 12 and 15 on my system using the simple program you shared.

liuyis@liuyis-ws-cn:~/LS/WS/Forum-256490$ cat omp.c
#include <omp.h>
#include <stdio.h>
#include <stdlib.h>

int main(int argc, char *argv[]) {
#pragma omp parallel
          { printf("Hello World from thread = %d\n", omp_get_thread_num()); }

            return 0;

liuyis@liuyis-ws-cn:~/LS/WS/Forum-256490$ ./clang+llvm-15.0.0-x86_64-linux-gnu-rhel-8.4/bin/clang -fopenmp=libomp omp.c
liuyis@liuyis-ws-cn:~/LS/WS/Forum-256490$ ldd ./a.out
        linux-vdso.so.1 (0x00007fffd3728000)
        libomp.so => /home/liuyis/LS/WS/Forum-256490/clang+llvm-15.0.0-x86_64-linux-gnu-rhel-8.4/lib/libomp.so (0x00007f9f1fe27000)
        libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f9f1fdea000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f9f1fbf8000)
        librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007f9f1fbee000)
        libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f9f1fbe8000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f9f1ff22000)
liuyis@liuyis-ws-cn:~/LS/WS/Forum-256490$ export OMP_PROC_BIND=spread
liuyis@liuyis-ws-cn:~/LS/WS/Forum-256490$ export OMP_PLACES=threads
liuyis@liuyis-ws-cn:~/LS/WS/Forum-256490$ OMP_NUM_THREADS=4 nsys profile --trace=openmp,osrt --output=minimal ./a.out
Hello World from thread = 0
Hello World from thread = 1
Hello World from thread = 2
Hello World from thread = 3
Generating '/tmp/nsys-report-39c4.qdstrm'
[1/1] [========================100%] minimal.nsys-rep

Was there anything wrong with the commands I used? If not, could you help collect logs? Follow the steps:

  1. Save the following content to /tmp/nvlog.config:
+ 75iwef global

$ /tmp/nsight-sys.log


Format $sevc$time|${name:0}|${tid:5}|${file:0}:${line:0}[${sfunc:0}]:$text
  1. Add --env-var=NVLOG_CONFIG_FILE=/tmp/nvlog.config to your Nsys command line. E.g. OMP_NUM_THREADS=4 nsys profile --env-var=NVLOG_CONFIG_FILE=/tmp/nvlog.config --trace=openmp,osrt --output=minimal ./a.out
  2. Run a collection. There should be a log at /tmp/nsight-sys.log. Share the log to us.

Also, if we still cannot figure out the root cause from the log, we might need to get access to a system that can reproduce the issue to debug further. Is is possible for you to share one?


Thanks for checking @liuyis; there doesn’t seem to be anything wrong with the compilation process in your post. Aside from installation paths, the ldd output is the same for me as well. I suppose this means it might be an issue with the local installation.

The log seems to primarily be these two lines repeated over and over at the end:

$ sed -Ee 's/^[^|]+|//' /tmp/nsight-sys.log | sort | uniq -c
 233321 |Injection|52881|InjectionDL.cpp:399[NSYS_DL_dlsym]:Handling dlsym(0xffffffffffffffff, ompt_start_tool) = 0x7f74059b8df0
 233322 |quadd_common_runtime_elf|52881|DLSymHook.cpp:261[DLSymNext]:RTLD_NEXT for 'ompt_start_tool' from 0x7f74059b8e0e

I noticed that the nsys itself was linking against libgcc_s from the GCC installation as well, but removing this from the linker path so that it would fall back to /lib64/libgcc_s.so.1 did not change anything. I’m guessing this detail is not relevant to the issue at hand.

Here is the “full” logfile, or at least just the first 269 lines. As you can guess from the above output, the rest of the logfile doesn’t contain any additional useful information. It seems to be stuck in some kind of loop?
nsight-sys.log (28.0 KB)

Regarding sharing access, if this is needed I will have to ask if it is possible.

The log seems to indicate there’s something wrong with the interposer module, @afroger do you have any insights?

@alessandro.vinciguerra Could you also try if the hanging goes away by disabling OSRT trace, i.e. remove osrt from --trace= option?

1 Like

It no longer hangs with the OSRT trace removed, thanks! OSRT does provide useful information, though, so it would be nice to be able to keep OSRT as well.

I have attached the log from the OpenMP-only trace for reference: nsight-sys.log (13.7 KB)

In the meantime, removing the OSRT trace does allow simultaneous CUDA and OpenMP profiling, which was the original goal, so I will use this workaround for now.


I encountered the same problem on our Karolina supercomputer. Nsight 2023.2.3.1001-32894139v0 Linux, CUDA 12.2, driver 535.104.05, A100-40GB, CentOS-7.

I observer this only happens when the code is compiler with clang or a clang-based compiler (clang 13.0.1, icpx 2023.1.0.20230320). When the program is compiled with g++, the profiling works fine.

Disabling the OSRT trace resolves the issue for me too. But have you found any other workaround? Or more importantly, can we expect an update with a fix in the future?


Hi I have the same problem on 2023 and 2024 versions of nsight. I just introduced openmp into my program and now it hangs on getting the number of open mp threads function call. I can run nsys with cuda and nvtx but I’d like ot be able to not specify the trace option. Any ideas?