Nsight Compute does not detect kernel launches for OpenMP offloaded code

I am currently working with OpenMP offloading using LLVM/clang-16 (built from github). Using the built-in profiling tools in clang (using environment variables such as LIBOMPTARGET_PROFILE=profile.json and LIBOMPTARGET_INFO) I was able to confirm that my code is executed on my GPU (GTX 1080, CC 6.1) but when I try to profile the code using nvprof or ncu (from the NVIDIA Nsight tool suite) I get an error/warning stating, that the profiler did not detect any kernel launches:

> ncu ./saxpy
Time of kernel: 0.000004
==WARNING== No kernels were profiled.
==WARNING== Profiling kernels launched by child processes requires the --target-processes all option.

This is my test code:

#include <iostream>
#include <omp.h>
#include <cstdlib>

void saxpy(float a, float* x, float* y, int sz) {
	double t = 0.0;
	double tb, te;
	tb = omp_get_wtime();
#pragma omp target teams distribute parallel for map(to:x[0:sz]) map(tofrom:y[0:sz])
	for (int i = 0; i < sz; i++) {
		y[i] = a * x[i] + y[i];
	te = omp_get_wtime();
	t = te - tb;
	printf("Time of kernel: %lf\n", t);

int main() {
	auto x = (float*) malloc(1000 * sizeof(float));
	auto y = (float*) calloc(1000, sizeof(float));
	for (int i = 0; i < 1000; i++) {
		x[i] = i;
	saxpy(42, x, y, 1000);
    return 0;

Compiled using the following command:

> clang++ -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda main.cpp -o saxpy --cuda-path=/opt/nvidia/hpc_sdk/Linux_x86_64/22.11/cuda/10.2 --offload-arch=sm_61 -fopenmp-offload-mandatory

What do I need to do to enable profiling? I have seen others using ncu for clang compiled OpenMP offloading code without additional steps but maybe I am completely missing something.

In general, Nsight Compute (ncu) does not support Pascal-class GPUs. In your case, it seems it does not even detect the kernel being executed on that device, or rather the kernel is not executed on the device, since the respective error message would have been shown otherwise. nvprof does still support profiling on Pascal GPUs, which seems to indicate the kernel is not actually running, or not recognized. You may need to pass --openmp-profiling on to nvprof, see the documentation here. It may also help to provide the output when trying to trace the application with nvprof here for reference.

Nsight Compute 2019.5.0, available here, is the last Pascal 6.1 supporting Linux version.

Alright I was not aware of that. I switched to my actual testing GPU (A100), the GTX 1080 was just what I had locally. Through --openmp-profiling on I was able to see nvprof output locally on my GTX 1080. What can I do to enable the profiling using ncu or nsys on the A100? After taking a quick look at the documentation I did not see a similar option to nvprof.

Edit: I used another small test code to check if the device is correctly used:

#include <omp.h>

int main() {
  int isDevice = 0;
#pragma omp target map(from : isDevice)
  { isDevice = omp_is_initial_device(); }
  return isDevice;

Compiled with clang++ test.cpp -fopenmp --offload-arch=<your-sm> -o test.
This code returns 1 whenever I try to start it with ncu ./test meaning the code is not correctly run on the target device. When ran “standalone” (just ./test) it returns 0…

I tried compiling this app with NVIDIA’s nvc compiler from the HPC SDK with OpenMP enabled and ncu can profile it fine.

root@83d6452ff365:~# nvc -mp=gpu omp_test.c
root@83d6452ff365:~# ./a.out 
root@83d6452ff365:~# echo $?
root@83d6452ff365:~# ncu ./a.out 
==PROF== Connected to process 105 (/root/a.out)
==PROF== Profiling "nvkernel_main_F1L6_2" - 0: 0%....50%....100% - 9 passes
==PROF== Disconnected from process 105
[105] a.out@
  nvkernel_main_F1L6_2, 2023-Jan-10 14:26:51, Context 1, Stream 14
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           4.71
    SM Frequency                                                             cycle/usecond                         946.81
    Elapsed Cycles                                                                   cycle                           7142
    Memory [%]                                                                           %                          10.65
    DRAM Throughput                                                                      %                           3.26
    Duration                                                                       usecond                           7.52
    L1/TEX Cache Throughput                                                              %                          12.95
    L2 Cache Throughput                                                                  %                          10.65
    SM Active Cycles                                                                 cycle                        3927.10
    Compute (SM) [%]                                                                     %                           7.24
    ---------------------------------------------------------------------- --------------- ------------------------------

This indicates that the issue may be specific to clang’s implementation of OpenMP offload. Can you use the NVIDIA HPC compiler, or otherwise provide more details on how to get or build the version of clang you are using?

I am aware that it works with the nvc compiler but I sadly cannot use it for my project.
The clang that I use is built from source, specifically the source at this commit (so the version is clang-16).

The excact procedure to build this offloading capable version of LLVM/clang is as follows:

  1. Clone the repository
git clone https://github.com/llvm/llvm-project.git -b main --single-branch
cd llvm-project
git reset --hard 847fa84b3d346313bbad31d4c76b0f70d73827aa
  1. Build the compiler
mkdir build
cd build
make -j20
  1. Bootstrap to create some device-specific libraries:
cd ..
mkdir build2
cd build2
CC=../build/bin/clang CXX=../build/bin/clang++ cmake ../llvm/ -DCMAKE_BUILD_TYPE=Release -DCLANG_OPENMP_NVPTX_DEFAULT_ARCH=sm_80 -DLIBOMPTARGET_NVPTX_COMPUTE_CAPABILITIES=70,80 -DLLVM_TARGETS_TO_BUILD="X86;NVPTX" -DLLVM_ENABLE_PROJECTS="clang;clang-tools-extra;openmp;compiler-rt" -DCMAKE_INSTALL_PREFIX=/opt
make -j20
make install

So overall a very complicated process to get to the exact compiler version I am using…

I tried to follow your steps, but building the compiler failed with this error:

/llvm-project/llvm/lib/Support/Compression.cpp: In function ‘void llvm::compression::zlib::compress(llvm::ArrayRef<unsigned char>, llvm::SmallVectorImpl<unsigned char>&, int)’:
/llvm-project/llvm/lib/Support/Compression.cpp:109:36: error: ‘::compressBound’ has not been declared; did you mean ‘compress2’?
  109 |   unsigned long CompressedSize = ::compressBound(Input.size());

That’s sadly an error I have not yet encountered so I can only take guesses as to why this happens…
What compiler are you using to compile LLVM (for reference: I am using gcc-11.3.0)?
Do you perhaps have access to Singularity (container platform) since I could provide you with a working definition file for that.
And thanks for putting in so much effort into solving this niche problem!

I was able to reproduce the behavior you are seeing when compiling with gcc 11. The problem is that when run under ncu, the OpenMP CUDA runtime library used by clang is not found by the dynamic linker. We will be looking into why that is the case. In the meantime, passing the path to the clang omp libraries manually via LD_LIBRARY_PATH should workaround the problem. This worked for me:

$ LD_LIBRARY_PATH=/opt/lib ncu ./a.out
==PROF== Connected to process 115098 (/var/llvm/a.out)
==PROF== Profiling "__omp_offloading_802_a1546_ma..." - 0: 0%....50%....100% - 10 passes
==PROF== Disconnected from process 115098
[115098] a.out@
  __omp_offloading_802_a1546_main_l7 (1, 1, 1)x(33, 1, 1), Context 1, Stream 13, Device 0, CC 8.0
    Section: GPU Speed Of Light Throughput
    ----------------------- ------------- ------------
    Metric Name               Metric Unit Metric Value
    ----------------------- ------------- ------------
    DRAM Frequency          cycle/nsecond         1.29
    SM Frequency            cycle/usecond       730.72
    Elapsed Cycles                  cycle       29,043
    Memory Throughput                   %         0.09
    DRAM Throughput                     %         0.00
    Duration                      usecond        39.74
    L1/TEX Cache Throughput             %         3.03
    L2 Cache Throughput                 %         0.09
    SM Active Cycles                cycle       218.71
    Compute (SM) Throughput             %         0.02
    ----------------------- ------------- ------------

This workaround (using LD_LIBRARY_PATH) also works for me, thanks again for helping me.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

We identified the issue and have a fix available. It is planned to be released with a future version of the tool.