Nsight Compute does not detect kernel launches for OpenMP offloaded code

vincent-adam · January 6, 2023, 2:06pm

I am currently working with OpenMP offloading using LLVM/clang-16 (built from github). Using the built-in profiling tools in clang (using environment variables such as LIBOMPTARGET_PROFILE=profile.json and LIBOMPTARGET_INFO) I was able to confirm that my code is executed on my GPU (GTX 1080, CC 6.1) but when I try to profile the code using nvprof or ncu (from the NVIDIA Nsight tool suite) I get an error/warning stating, that the profiler did not detect any kernel launches:

> ncu ./saxpy
Time of kernel: 0.000004
==WARNING== No kernels were profiled.
==WARNING== Profiling kernels launched by child processes requires the --target-processes all option.

This is my test code:

#include <iostream>
#include <omp.h>
#include <cstdlib>

void saxpy(float a, float* x, float* y, int sz) {
	double t = 0.0;
	double tb, te;
	tb = omp_get_wtime();
#pragma omp target teams distribute parallel for map(to:x[0:sz]) map(tofrom:y[0:sz])
{
	for (int i = 0; i < sz; i++) {
		y[i] = a * x[i] + y[i];
	}
}
	te = omp_get_wtime();
	t = te - tb;
	printf("Time of kernel: %lf\n", t);
}

int main() {
	auto x = (float*) malloc(1000 * sizeof(float));
	auto y = (float*) calloc(1000, sizeof(float));
	
	for (int i = 0; i < 1000; i++) {
		x[i] = i;
	}
	
	saxpy(42, x, y, 1000);
	
    return 0;
}

Compiled using the following command:

> clang++ -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda main.cpp -o saxpy --cuda-path=/opt/nvidia/hpc_sdk/Linux_x86_64/22.11/cuda/10.2 --offload-arch=sm_61 -fopenmp-offload-mandatory

What do I need to do to enable profiling? I have seen others using ncu for clang compiled OpenMP offloading code without additional steps but maybe I am completely missing something.

felix_dt · January 6, 2023, 2:14pm

In general, Nsight Compute (ncu) does not support Pascal-class GPUs. In your case, it seems it does not even detect the kernel being executed on that device, or rather the kernel is not executed on the device, since the respective error message would have been shown otherwise. nvprof does still support profiling on Pascal GPUs, which seems to indicate the kernel is not actually running, or not recognized. You may need to pass --openmp-profiling on to nvprof, see the documentation here. It may also help to provide the output when trying to trace the application with nvprof here for reference.

rs277 · January 6, 2023, 6:22pm

Nsight Compute 2019.5.0, available here, is the last Pascal 6.1 supporting Linux version.

vincent-adam · January 10, 2023, 1:05pm

Alright I was not aware of that. I switched to my actual testing GPU (A100), the GTX 1080 was just what I had locally. Through --openmp-profiling on I was able to see nvprof output locally on my GTX 1080. What can I do to enable the profiling using ncu or nsys on the A100? After taking a quick look at the documentation I did not see a similar option to nvprof.

Edit: I used another small test code to check if the device is correctly used:

#include <omp.h>

int main() {
  int isDevice = 0;
#pragma omp target map(from : isDevice)
  { isDevice = omp_is_initial_device(); }
  return isDevice;
}

Compiled with clang++ test.cpp -fopenmp --offload-arch=<your-sm> -o test.
This code returns 1 whenever I try to start it with ncu ./test meaning the code is not correctly run on the target device. When ran “standalone” (just ./test) it returns 0…

felix_dt · January 10, 2023, 3:17pm

I tried compiling this app with NVIDIA’s nvc compiler from the HPC SDK with OpenMP enabled and ncu can profile it fine.

root@83d6452ff365:~# nvc -mp=gpu omp_test.c
root@83d6452ff365:~# ./a.out 
root@83d6452ff365:~# echo $?
0
root@83d6452ff365:~# ncu ./a.out 
==PROF== Connected to process 105 (/root/a.out)
==PROF== Profiling "nvkernel_main_F1L6_2" - 0: 0%....50%....100% - 9 passes
==PROF== Disconnected from process 105
[105] a.out@127.0.0.1
  nvkernel_main_F1L6_2, 2023-Jan-10 14:26:51, Context 1, Stream 14
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           4.71
    SM Frequency                                                             cycle/usecond                         946.81
    Elapsed Cycles                                                                   cycle                           7142
    Memory [%]                                                                           %                          10.65
    DRAM Throughput                                                                      %                           3.26
    Duration                                                                       usecond                           7.52
    L1/TEX Cache Throughput                                                              %                          12.95
    L2 Cache Throughput                                                                  %                          10.65
    SM Active Cycles                                                                 cycle                        3927.10
    Compute (SM) [%]                                                                     %                           7.24
    ---------------------------------------------------------------------- --------------- ------------------------------

This indicates that the issue may be specific to clang’s implementation of OpenMP offload. Can you use the NVIDIA HPC compiler, or otherwise provide more details on how to get or build the version of clang you are using?

vincent-adam · January 10, 2023, 3:29pm

I am aware that it works with the nvc compiler but I sadly cannot use it for my project.
The clang that I use is built from source, specifically the source at this commit (so the version is clang-16).

The excact procedure to build this offloading capable version of LLVM/clang is as follows:

Clone the repository

git clone https://github.com/llvm/llvm-project.git -b main --single-branch
cd llvm-project
git reset --hard 847fa84b3d346313bbad31d4c76b0f70d73827aa

Build the compiler

mkdir build
cd build
cmake ../llvm/ -DCMAKE_BUILD_TYPE=Release -DCLANG_OPENMP_NVPTX_DEFAULT_ARCH=sm_80 -DLIBOMPTARGET_NVPTX_COMPUTE_CAPABILITIES=70,80 -DLLVM_TARGETS_TO_BUILD="X86;NVPTX" -DLLVM_ENABLE_PROJECTS="clang;clang-tools-extra;openmp;compiler-rt"
make -j20

Bootstrap to create some device-specific libraries:

cd ..
mkdir build2
cd build2
CC=../build/bin/clang CXX=../build/bin/clang++ cmake ../llvm/ -DCMAKE_BUILD_TYPE=Release -DCLANG_OPENMP_NVPTX_DEFAULT_ARCH=sm_80 -DLIBOMPTARGET_NVPTX_COMPUTE_CAPABILITIES=70,80 -DLLVM_TARGETS_TO_BUILD="X86;NVPTX" -DLLVM_ENABLE_PROJECTS="clang;clang-tools-extra;openmp;compiler-rt" -DCMAKE_INSTALL_PREFIX=/opt
make -j20
make install

So overall a very complicated process to get to the exact compiler version I am using…

felix_dt · January 10, 2023, 3:50pm

I tried to follow your steps, but building the compiler failed with this error:

/llvm-project/llvm/lib/Support/Compression.cpp: In function ‘void llvm::compression::zlib::compress(llvm::ArrayRef<unsigned char>, llvm::SmallVectorImpl<unsigned char>&, int)’:
/llvm-project/llvm/lib/Support/Compression.cpp:109:36: error: ‘::compressBound’ has not been declared; did you mean ‘compress2’?
  109 |   unsigned long CompressedSize = ::compressBound(Input.size());

vincent-adam · January 10, 2023, 4:04pm

That’s sadly an error I have not yet encountered so I can only take guesses as to why this happens…
What compiler are you using to compile LLVM (for reference: I am using gcc-11.3.0)?
Do you perhaps have access to Singularity (container platform) since I could provide you with a working definition file for that.
And thanks for putting in so much effort into solving this niche problem!

felix_dt · January 10, 2023, 8:47pm

I was able to reproduce the behavior you are seeing when compiling with gcc 11. The problem is that when run under ncu, the OpenMP CUDA runtime library used by clang is not found by the dynamic linker. We will be looking into why that is the case. In the meantime, passing the path to the clang omp libraries manually via LD_LIBRARY_PATH should workaround the problem. This worked for me:

$ LD_LIBRARY_PATH=/opt/lib ncu ./a.out
==PROF== Connected to process 115098 (/var/llvm/a.out)
==PROF== Profiling "__omp_offloading_802_a1546_ma..." - 0: 0%....50%....100% - 10 passes
target
==PROF== Disconnected from process 115098
[115098] a.out@127.0.0.1
  __omp_offloading_802_a1546_main_l7 (1, 1, 1)x(33, 1, 1), Context 1, Stream 13, Device 0, CC 8.0
    Section: GPU Speed Of Light Throughput
    ----------------------- ------------- ------------
    Metric Name               Metric Unit Metric Value
    ----------------------- ------------- ------------
    DRAM Frequency          cycle/nsecond         1.29
    SM Frequency            cycle/usecond       730.72
    Elapsed Cycles                  cycle       29,043
    Memory Throughput                   %         0.09
    DRAM Throughput                     %         0.00
    Duration                      usecond        39.74
    L1/TEX Cache Throughput             %         3.03
    L2 Cache Throughput                 %         0.09
    SM Active Cycles                cycle       218.71
    Compute (SM) Throughput             %         0.02
    ----------------------- ------------- ------------

vincent-adam · January 11, 2023, 10:46am

This workaround (using LD_LIBRARY_PATH) also works for me, thanks again for helping me.

system · January 25, 2023, 10:46am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

felix_dt · February 28, 2023, 2:53pm

We identified the issue and have a fix available. It is planned to be released with a future version of the tool.

Topic		Replies	Views
Ncu does not detect kernels, ==ERROR== The application returned an error code (11) Nsight Compute kernel , profiling	6	1887	December 13, 2023
Nsight Compute not detecting kernel launch Nsight Compute profiling	13	3083	May 6, 2021
Nsight-Compute returns “No kernels were profiled” warning Nsight Compute	9	1462	July 27, 2023
==ERROR== Failed to prepare kernel for profiling (0xc00000fd) but CUDA sample works Nsight Compute kernel , nvbugs	13	2057	November 6, 2021
Can't Get NCU GUI To Import Properly Nsight Compute	8	1349	October 5, 2020
Question about profiling nccl kernels with Nsight Compute Nsight Compute	20	4962	February 13, 2025
Nsight Compute 2023.3.1.0 can't handle G++ OpenMP offload kernel naming format Nsight Compute	11	1188	June 30, 2024
Nsight Profiler Hangs on OpenMP Initialization Profiling Linux Targets profiling	9	1209	February 29, 2024
Option to profile only master process Nsight Compute cuda	23	3544	December 1, 2023
NVIDIA NSight Compute: The profiler returned an error code:1 Nsight Compute	13	1952	March 18, 2024

Nsight Compute does not detect kernel launches for OpenMP offloaded code

Related topics