Nsight Compute not detecting kernel launch


I am having a hard time profiling my instruction scheduling kernel using Nvidia Nsight Compute. I recently updated to an RTX 3080 in my environment and can no longer use nvprof as I had before. I am trying to profile a plugin for Clang-7 that performs instruction scheduling by launching a kernel to perform ACO scheduling. I had no issues with using nvprof to profile on my GTX 1080, but when I use the same launch parameters in Nsight Compute, the kernel launches and executes as expected, but Nsight reports ==WARNING== No kernels were profiled.

I did make some changes to the kernel after switching to the RTX 3080. I changed my thread sync method to use this_grid().sync() to sync all threads and switched the kernel launch to cudaLaunchCooperativeKernel() from the usual kernel launch in order to use cooperative groups. I also updated the nvcc flags to the new GPU architecture -gencode arch=compute_86,code=sm_86. Other than these changes, nothing else changed between using nvprof on my GTX 1080 and attempting to use NsightCompute on my RTX 3080

I am unable to make a minimal reproduction of this issue, the NVIDIA toolkit sample kernels profile as expected.

For reference, here is the command I tried in terminal that resulted in the ==WARNING== No kernels were profiled. message even though I can confirm multiple kernels launched and executed successfully:

sudo /usr/local/cuda-11.2/bin/ncu --target-processes all /home/vlad/CompilerProject/v7flang/flang-install/bin/clang -m64 -c -o lbm.o -DSPEC -DNDEBUG -DSPEC_AUTO_SUPPRESS_OPENMP -O3 -fplugin=/home/vlad/CompilerProject/v7flang/llvm/dev_aco_release/lib/OptSched.so -mllvm -misched=optsched -mllvm -optsched-cfg=/home/vlad/CompilerProject/optsched-cfgs/GPU_ACO -DSPEC_LP64 /home/vlad/CompilerProject/CPU2017/benchspec/CPU/519.lbm_r/src/lbm.c

This command launches clang-7 with my scheduling plugin OptSched.so (which contains the scheduling CUDA kernel) to build the lbm benchmark in SPEC CPU2017. The scheduling kernel is launched successfully 12 times during building of lbm.c and reports no errors, but no kernels are profiled.

In case it is important, here is also the flags that I use to compile my CUDA C++ code with nvcc:

-x cu -Xcompiler "-fPIC -fvisibility-inlines-hidden -Werror=date-time -std=c++14 -Wall -W -Wno-unused-parameter -Wwrite-strings -Wcast-qual -Wno-missing-field-initializers -Wno-long-long -Wno-maybe-uninitialized -Wdelete-non-virtual-dtor -Wno-comment -ffunction-sections -fdata-sections -O3 -DNDEBUG -fno-exceptions -fno-rtti" -gencode arch=compute_86,code=sm_86 -dlink --ptxas-options=-v -rdc true -lineinfo

This is on Ubuntu 18.04 with the latest CUDA 11.2 toolkit. I should note Nsight Systems profiles the host code with no issues using the same config and detects the kernel launches but does not provide any useful information about them.

What am I doing wrong here? Can cudaLaunchCooperativeKernel() kernels not be profiled? Am I missing a compilation flag or is one of my flags preventing profiling?

I appreciate any input anyone has on this issue, as it is preventing me from optimizing kernel performance on my research project.

Thank you for your time, please inquire if you have any question about my environment.

Small update.
I confirmed kernels launched using cudaLaunchCooperativeKernel() are able to be profiled by replacing the kernel launch in vectorAdd.cu sample with cudaLaunchCooperativeKernel(). Nsight does not seem to be able to detect any CUDA API calls that happen in the clang-7 process.

Just for our understanding, can you please confirm the exact ncu version you are using (i.e. by posting the output of ncu --version)?

Can you share more instructions on how to reproduce this setup (i.e. CUDA kernels being launched by a clang plugin) for debugging purposes?

Hello Felix,

Here is the printout from ncu -- version:

NVIDIA (R) Nsight Compute Command Line Profiler
Copyright (c) 2018-2021 NVIDIA Corporation
Version 2020.3.1.0 (build 29567428) (public-release)

I would like to note that no CUDA API calls are detected by ncu, not just kernels. I would also like to stress that nvprof had no issues profiling this project and all of its CUDA API calls on my GTX 1080.

For reproducing this setup, you would need to build flang-llvm-7 with my OptSched branch in the projects folder.
Here is a link to my fork of the project: https://github.com/VladM1076/OptSched.git
You will need to switch to the GPU_ACO branch for the version that schedules using CUDA.

Unfortunately the build process is fairly long and I have not been able to figure out how to set cmake flags for the CUDA C++ files correctly so they must be set manually. Hopefully this explanation is adequate and helps with resolving whatever issue is causing CUDA API calls not to be detected.

After build is completed, you will be able to attempt to profile the compile of any C/C++ file using the command:
sudo /usr/local/cuda-11.2/bin/ncu --target-processes all /<path to v7flang location>/v7flang/flang-install/bin/clang -m64 -c -o lbm.o -DSPEC -DNDEBUG -DSPEC_AUTO_SUPPRESS_OPENMP -O3 -fplugin=/<path to v7flang location>/v7flang/llvm/build/lib/OptSched.so -mllvm -misched=optsched -mllvm -optsched-cfg=/<path to v7flang location>/v7flang/llvm/projects/OptSched/example/optsched-cfg/ -DSPEC_LP64 <path to any C++ file>
Note the 3 <path to v7flang location> tags that must be replaced with where v7flang directory is located and the <path to any C++ file> must be replaced with a file to compile.
This will compile any C++ file using our scheduler running on CUDA.
You can also add the --mode=launch flag to ncu to try to attach from ncu gui, but that did not work for me either.

Here are the steps I took to build flang/clang and my plugin scheduler:

#step 1 setup folder for build files and install dir:
mkdir -p v7flang/flang-install
cd v7flang/flang-install
cd ..

#step 2 Get the llvm flang, OptSched and apply patches
#run these commands in the directory v7flang

#step 2.1 get flang llvm
git clone https://github.com/flang-compiler/llvm.git
cd llvm
git checkout release_70

#step 2.2 get OptSched
cd projects
git clone https://github.com/VladM1076/OptSched.git

#step 2.2.1 checkout the branch of OptSched that uses CUDA
cd OptSched
git checkout GPU_ACO
cd ..

#step 2.3 get back to the directory v7flang/llvm
cd ..

#step 2.4 apply the print spilling patch for llvm 7
git apply projects/OptSched/patches/llvm7.0/flang-llvm7-print-spilling-info.patch

#step 2.5 build llvm
mkdir build
cd build
#fix ninja.build file to allow building of CUDA files
vim build.ninja
# here you must modify the files' build commands. 
# Search for aco.cu (/aco.cu) to find the start of the files' build instructions. 
# Then make the following changes: 
# 1) Make sure each file is set to use CUDA_COMPILER  instead of CXX_COMPILER
# 2) Change the flags to the ones used in my first post, 
# with the architecture set to your Nvidia GPU's arch
# The files that must have their build instructions changed are:
# aco.cu, bb_spill.cpp, data_dep.cpp, gen_sched.cpp, graph.cpp, list_sched.cpp, 
# machine_model.cpp, random.cpp, ready_list.cpp, register.cpp, sched_basic_data.cpp, 
# sched_region.cpp, and OptimizingScheduler.cpp
# after build.ninja has been modified save a copy of it, 
# sometimes cmake reruns and deletes the changes
cp build.ninja cuda_build.ninja
# if cmake reruns, copy cuda_build.ninja back to build.ninja with cp cuda_build.ninja build.ninja

#step 2.6 go back to the directory v7flang
cd ../..

# I am not sure if the rest of the build instructions are necessary to reproduce this issue, but I have always built
# flang-driver, openmp, and flang when running the project, however these might only be necessary for FORTRAN file compilation

#step 3 install flang-driver
#run these commands from v7flang

#step 3.1 download/configure flang-driver
git clone https://github.com/flang-compiler/flang-driver.git
cd flang-driver
git checkout release_70

#step 3.2 build flang-driver
mkdir build
cd build
make -j2 install

#step 3.3 get back to v7flang
cd ../../

#step 4 install flang openmp
#run these commands from v7flang

#step 4.1 download/configure openmp
git clone https://github.com/llvm-mirror/openmp.git
cd openmp
git checkout release_70

#step 4.2 build flang openmp
mkdir build
cd build
make -j2 install

#step 4.3 get back to v7flang
cd ../..

#step 5 build FLANG

#step 5.1 clone from GitHub
git clone https://github.com/flang-compiler/flang.git

#step 5.1.1 (only necessary if you are building with an old version of CMake like the one on optimizer/optimizer2)
sed -i 's/NATIVE_COMMAND/UNIX_COMMAND/g' flang/runtime/flang/CMakeLists.txt

#step 5.2 build libpgmath
cd flang/runtime/libpgmath
mkdir build
cd build
make -j2 install

#step 5.3 go back to v7flang/flang
cd ../../..

#step 5.4 build flang
mkdir build
cd build
make -j2 install

#step 5.5 back out of the directory structure we just created
cd ../../..

#step 6 (optional) add $FLANG_INSTALL to your .bashrc and add the fortran runtime to your library path
echo "export FLANG_INSTALL=$FLANG_INSTALL" >> ~/.bashrc
echo 'export LD_LIBRARY_PATH=$FLANG_INSTALL/lib:$LD_LIBRARY_PATH' >> ~/.bashrc

Another small update.
I just update to CUDA 11.3 and rebuilt the project using nvcc from CUDA 11.3. Nsight Compute still does not profile clang-7 properly and still does not detect any CUDA API calls and prints ==WARNING== No kernels were profiled. Using the --mode=launch flag never pauses the process and process never appears in Nsight Compute GUI as available to be attached too.

Versions for both:

vlad@vlad-MS-7C80:~/CompilerProject/CPU2017$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Sun_Mar_21_19:15:46_PDT_2021
Cuda compilation tools, release 11.3, V11.3.58
Build cuda_11.3.r11.3/compiler.29745058_0
vlad@vlad-MS-7C80:~/CompilerProject/CPU2017$ ncu --version
NVIDIA (R) Nsight Compute Command Line Profiler
Copyright (c) 2018-2021 NVIDIA Corporation
Version 2021.1.0.0 (build 29693910) (public-release)

Thanks for the details. I will have a try with your instructions and get back to you once I have more info.

After some minor modifications, I was able to build the plugin, but I am seeing the following error when trying to run it with clang-7

clang-7 -m64 -c -o test.o -DSPEC -DNDEBUG -DSPEC_AUTO_SUPPRESS_OPENMP -O3 -fplugin=llvm/build/lib/OptSched.so -mllvm -misched=optsched -mllvm -optsched=cfg=llvm/projects/OptSched/example/optsched-cfg/ app.cpp 

/usr/lib/llvm-7/bin/clang: symbol lookup error: llvm/build/lib/OptSched.so: undefined symbol: __cudaRegisterLinkedBinary_38_tmpxft_0000a480_00000000_7_aco_cpp1_ii_e1135e14

The ninja link instructions for aco.cu.o look like this:

build projects/OptSched/lib/CMakeFiles/obj.OptSched.dir/Scheduler/aco.cu.o: 
CUDA_COMPILER__obj.2eOptSched ../projects/OptSched/lib/Scheduler/aco.cu || cmake_object_order_depends_target_obj.OptSched
DEP_FILE = projects/OptSched/lib/CMakeFiles/obj.OptSched.dir/Scheduler/aco.cu.o.d
FLAGS = -x cu -Xcompiler "-fPIC -fvisibility-inlines-hidden  -std=c++14 -Wall -W -Wno-unused-parameter -Wwrite-strings -Wcast-qual -Wno-missing-field-initializers -Wno-long-long -Wno-maybe-uninitialized -Wdelete-non-virtual-dtor -Wno-comment -ffunction-sections -fdata-sections -O3 - 
DNDEBUG -fno-exceptions -fno-rtti" -gencode arch=compute_75,code=sm_75 -dlink --ptxas-options=-v -rdc true -lineinfo
INCLUDES = -Iprojects/OptSched/lib -I../projects/OptSched/lib -Iinclude -I../include -I../projects/OptSched/include -I/usr/local/cuda-11.3/include
OBJECT_DIR = projects/OptSched/lib/CMakeFiles/obj.OptSched.dir
OBJECT_FILE_DIR = projects/OptSched/lib/CMakeFiles/obj.OptSched.dir/Scheduler
TARGET_COMPILE_PDB = projects/OptSched/lib/CMakeFiles/obj.OptSched.dir/

Hello Felix,

Your build/link instructions are correct and match mine, this error is caused by cmake not linking device code properly.

I actually just ran into this problem myself when building the project on a new machine. For some reason newer versions of cmake (3.16 in my case) do not link the device code properly (cmake_device_link.o is not present in build.ninja). Reverting to cmake 3.10.2 (the version on my personal machine) fixed this issue and allowed device code to be linked properly using the CMakeLists.txt present in the project.

What were the minor modifications you made? I would like to have as much documentation as possible for building this project as until yesterday I had only built it on my own machine.

I also found that on the new server I was setting up, I needed to add -I/usr/local/cuda/include to the includes of files in build.ninja compiled by CXX_COMPILER which I did not need to do on my machine, but this is probably because the CUDA includes were not present in CPATH on when configuring build with cmake.

On the new machine running Ubuntu 20 and a Tesla T4, I still get the ==WARNING== No kernels were profiled. message when trying to profile clang-7.

Thank you for taking the time to investigate this issue, I really appreciate it.

Let me know if cmake_device_link.o is missing from your build.ninja and if changing cmake version fixes the issue.

Adding the include directories, and changing the protected members to public for CallSiteBase.

I rebuilt it using cmake 3.10.2, and I can now execute the clang-7 command, but I see no CUDA work being launched, neither through nvprof, nor through Nsight Systems, and not in nvidia-smi.

Note also that this step obviously can’t work, as the file does not exist:

git apply projects/OptSched/patches/llvm7.0/flang-llvm7-print-spilling-info.patch

Here is the output of the clang-7 execution:

clang-7 -v -m64 -c -o test.o -DSPEC -DNDEBUG -DSPEC_AUTO_SUPPRESS_OPENMP -O3 -fplugin=llvm/build2/lib/OptSched.so -mllvm -misched=optsched -mllvm -optsched-cfg=llvm/projects/OptSched/example/optsched-cfg/ /home/user/cpp_quiz/quiz.cpp
clang version 7.0.1-12 (tags/RELEASE_701/final)
Target: x86_64-pc-linux-gnu
Thread model: posix
InstalledDir: /bin
Found candidate GCC installation: /bin/../lib/gcc/x86_64-linux-gnu/9
Found candidate GCC installation: /usr/lib/gcc/x86_64-linux-gnu/9
Selected GCC installation: /bin/../lib/gcc/x86_64-linux-gnu/9
Candidate multilib: .;@m64
Selected multilib: .;@m64
Found CUDA installation: /usr/local/cuda, version 7.0
"/usr/lib/llvm-7/bin/clang" -cc1 -triple x86_64-pc-linux-gnu -emit-obj -disable-free -disable-llvm-verifier -discard-value-names -main-file-name quiz.cpp -mrelocation-model static -mthread-model posix -fmath-errno -masm-verbose -mconstructor-aliases -munwind-tables -fuse-init-array -target-cpu x86-64 -dwarf-column-info -debugger-tuning=gdb -momit-leaf-frame-pointer -v -coverage-notes-file /home/user/flang/v7flang/test.gcno -resource-dir /usr/lib/llvm-7/lib/clang/7.0.1 -D SPEC -D NDEBUG -D SPEC_AUTO_SUPPRESS_OPENMP -internal-isystem /bin/../lib/gcc/x86_64-linux-gnu/9/../../../../include/c++/9 -internal-isystem /bin/../lib/gcc/x86_64-linux-gnu/9/../../../../include/x86_64-linux-gnu/c++/9 -internal-isystem /bin/../lib/gcc/x86_64-linux-gnu/9/../../../../include/x86_64-linux-gnu/c++/9 -internal-isystem /bin/../lib/gcc/x86_64-linux-gnu/9/../../../../include/c++/9/backward -internal-isystem /usr/include/clang/7.0.1/include/ -internal-isystem /usr/local/include -internal-isystem /usr/lib/llvm-7/lib/clang/7.0.1/include -internal-externc-isystem /usr/include/x86_64-linux-gnu -internal-externc-isystem /include -internal-externc-isystem /usr/include -O3 -fdeprecated-macro -fdebug-compilation-dir /home/user/flang/v7flang -ferror-limit 19 -fmessage-length 207 -fobjc-runtime=gcc -fcxx-exceptions -fexceptions -fdiagnostics-show-option -fcolor-diagnostics -vectorize-loops -vectorize-slp -load llvm/build2/lib/OptSched.so -mllvm -misched=optsched -mllvm -optsched-cfg=llvm/projects/OptSched/example/optsched-cfg/ -o test.o -x c++ /home/user/cpp_quiz/quiz.cpp -faddrsig
clang -cc1 version 7.0.1 based upon LLVM 7.0.1 default target x86_64-pc-linux-gnu
ignoring nonexistent directory "/include"
ignoring duplicate directory "/bin/../lib/gcc/x86_64-linux-gnu/9/../../../../include/x86_64-linux-gnu/c++/9"
ignoring duplicate directory "/usr/include/clang/7.0.1/include"
#include "..." search starts here:
#include <...> search starts here:
End of search list.

Hello Felix,

The print spilling info patch is optional, I forgot to merge it into my branch but it is not necessary.

I think this time the issue is the settings under llvm/projects/OptSched/example/optsched-cfg/sched.ini. The first option needs to be changed to YES.

# Use optimizing scheduling
# NO : No scheduling is done.
# HOT_ONLY: Only use scheduler with hot functions.

This is my mistake, I pushed this change to the default configuration on accident. Our scheduler only has hot region info for the benchmarks we are testing on.

On a similar note, the GPU scheduler is only set to launch if the scheduling region has 50 or more instructions, so depending on the size of regions in your code it might not be invoked. This can also be fixed by modifying llvm/projects/OptSched/include/opt-sched/Scheduler/aco.h and changing #define REGION_MIN_SIZE 50 to 0 and then invoking ninja again in the build directory.

I apologize for this oversight. Everything else looks correct and the scheduler should work after the above modifications.

I can reproduce this now locally after setting the config in the .ini file. I think this might be due to spawning processes with clone() is not fully supported in Nsight Compute, but clang-7 seems to use this mechanism.

I am glad to hear you have been able to reproduce this and it is not just me.
Is there any plans to add support or any possible workarounds for this? This is quite unfortunate since nvprof is not supported on the RTX 3080 so I am unable to get device profiles any other way as far as I know.

Yes, we will be looking into supporting this, but there is no short-term solution for ncu available beyond adding profiling directly to your target library. Technically, you can do this using CUPTI’s Profiling API [1], but it’s much less straightforward then using the tool of course.

An alternative could be to use Nsight System’s new metric sampling feature [2], which will give you a limited set of metrics over time (rather than directly correlated with your kernel). Last but not least, if you are building clang yourself anyways, maybe you can change it to use fork+exec rather than clone for child process creation?

[1] CUPTI :: CUPTI Documentation
[2] Latest Nsight Developer Tools Releases: Nsight Systems 2021.2, Nsight Compute 2021.1, Nsight Visual Studio Code Edition | NVIDIA Developer Blog

Ok I will look into these options. Thank you very much for your help!
I know the build process is pretty lengthy so I appreciate you sticking by it and diagnosing my issue.