I am having a hard time profiling my instruction scheduling kernel using Nvidia Nsight Compute. I recently updated to an RTX 3080 in my environment and can no longer use nvprof as I had before. I am trying to profile a plugin for Clang-7 that performs instruction scheduling by launching a kernel to perform ACO scheduling. I had no issues with using nvprof to profile on my GTX 1080, but when I use the same launch parameters in Nsight Compute, the kernel launches and executes as expected, but Nsight reports
==WARNING== No kernels were profiled.
I did make some changes to the kernel after switching to the RTX 3080. I changed my thread sync method to use this_grid().sync() to sync all threads and switched the kernel launch to cudaLaunchCooperativeKernel() from the usual kernel launch in order to use cooperative groups. I also updated the nvcc flags to the new GPU architecture
-gencode arch=compute_86,code=sm_86. Other than these changes, nothing else changed between using nvprof on my GTX 1080 and attempting to use NsightCompute on my RTX 3080
I am unable to make a minimal reproduction of this issue, the NVIDIA toolkit sample kernels profile as expected.
For reference, here is the command I tried in terminal that resulted in the
==WARNING== No kernels were profiled. message even though I can confirm multiple kernels launched and executed successfully:
sudo /usr/local/cuda-11.2/bin/ncu --target-processes all /home/vlad/CompilerProject/v7flang/flang-install/bin/clang -m64 -c -o lbm.o -DSPEC -DNDEBUG -DSPEC_AUTO_SUPPRESS_OPENMP -O3 -fplugin=/home/vlad/CompilerProject/v7flang/llvm/dev_aco_release/lib/OptSched.so -mllvm -misched=optsched -mllvm -optsched-cfg=/home/vlad/CompilerProject/optsched-cfgs/GPU_ACO -DSPEC_LP64 /home/vlad/CompilerProject/CPU2017/benchspec/CPU/519.lbm_r/src/lbm.c
This command launches clang-7 with my scheduling plugin OptSched.so (which contains the scheduling CUDA kernel) to build the lbm benchmark in SPEC CPU2017. The scheduling kernel is launched successfully 12 times during building of lbm.c and reports no errors, but no kernels are profiled.
In case it is important, here is also the flags that I use to compile my CUDA C++ code with nvcc:
-x cu -Xcompiler "-fPIC -fvisibility-inlines-hidden -Werror=date-time -std=c++14 -Wall -W -Wno-unused-parameter -Wwrite-strings -Wcast-qual -Wno-missing-field-initializers -Wno-long-long -Wno-maybe-uninitialized -Wdelete-non-virtual-dtor -Wno-comment -ffunction-sections -fdata-sections -O3 -DNDEBUG -fno-exceptions -fno-rtti" -gencode arch=compute_86,code=sm_86 -dlink --ptxas-options=-v -rdc true -lineinfo
This is on Ubuntu 18.04 with the latest CUDA 11.2 toolkit. I should note Nsight Systems profiles the host code with no issues using the same config and detects the kernel launches but does not provide any useful information about them.
What am I doing wrong here? Can cudaLaunchCooperativeKernel() kernels not be profiled? Am I missing a compilation flag or is one of my flags preventing profiling?
I appreciate any input anyone has on this issue, as it is preventing me from optimizing kernel performance on my research project.
Thank you for your time, please inquire if you have any question about my environment.