I am currently working with OpenMP offloading using LLVM/clang-16 (built from github). Using the built-in profiling tools in clang (using environment variables such as LIBOMPTARGET_PROFILE=profile.json and LIBOMPTARGET_INFO) I was able to confirm that my code is executed on my GPU (GTX 1080, CC 6.1) but when I try to profile the code using nvprof or ncu (from the NVIDIA Nsight tool suite) I get an error/warning stating, that the profiler did not detect any kernel launches:
> ncu ./saxpy
Time of kernel: 0.000004
==WARNING== No kernels were profiled.
==WARNING== Profiling kernels launched by child processes requires the --target-processes all option.
This is my test code:
#include <iostream>
#include <omp.h>
#include <cstdlib>
void saxpy(float a, float* x, float* y, int sz) {
double t = 0.0;
double tb, te;
tb = omp_get_wtime();
#pragma omp target teams distribute parallel for map(to:x[0:sz]) map(tofrom:y[0:sz])
{
for (int i = 0; i < sz; i++) {
y[i] = a * x[i] + y[i];
}
}
te = omp_get_wtime();
t = te - tb;
printf("Time of kernel: %lf\n", t);
}
int main() {
auto x = (float*) malloc(1000 * sizeof(float));
auto y = (float*) calloc(1000, sizeof(float));
for (int i = 0; i < 1000; i++) {
x[i] = i;
}
saxpy(42, x, y, 1000);
return 0;
}
What do I need to do to enable profiling? I have seen others using ncu for clang compiled OpenMP offloading code without additional steps but maybe I am completely missing something.
In general, Nsight Compute (ncu) does not support Pascal-class GPUs. In your case, it seems it does not even detect the kernel being executed on that device, or rather the kernel is not executed on the device, since the respective error message would have been shown otherwise. nvprof does still support profiling on Pascal GPUs, which seems to indicate the kernel is not actually running, or not recognized. You may need to pass --openmp-profiling on to nvprof, see the documentation here. It may also help to provide the output when trying to trace the application with nvprof here for reference.
Alright I was not aware of that. I switched to my actual testing GPU (A100), the GTX 1080 was just what I had locally. Through --openmp-profiling on I was able to see nvprof output locally on my GTX 1080. What can I do to enable the profiling using ncu or nsys on the A100? After taking a quick look at the documentation I did not see a similar option to nvprof.
Edit: I used another small test code to check if the device is correctly used:
Compiled with clang++ test.cpp -fopenmp --offload-arch=<your-sm> -o test.
This code returns 1 whenever I try to start it with ncu ./test meaning the code is not correctly run on the target device. When ran “standalone” (just ./test) it returns 0…
I tried compiling this app with NVIDIA’s nvc compiler from the HPC SDK with OpenMP enabled and ncu can profile it fine.
root@83d6452ff365:~# nvc -mp=gpu omp_test.c
root@83d6452ff365:~# ./a.out
root@83d6452ff365:~# echo $?
0
root@83d6452ff365:~# ncu ./a.out
==PROF== Connected to process 105 (/root/a.out)
==PROF== Profiling "nvkernel_main_F1L6_2" - 0: 0%....50%....100% - 9 passes
==PROF== Disconnected from process 105
[105] a.out@127.0.0.1
nvkernel_main_F1L6_2, 2023-Jan-10 14:26:51, Context 1, Stream 14
Section: GPU Speed Of Light Throughput
---------------------------------------------------------------------- --------------- ------------------------------
DRAM Frequency cycle/nsecond 4.71
SM Frequency cycle/usecond 946.81
Elapsed Cycles cycle 7142
Memory [%] % 10.65
DRAM Throughput % 3.26
Duration usecond 7.52
L1/TEX Cache Throughput % 12.95
L2 Cache Throughput % 10.65
SM Active Cycles cycle 3927.10
Compute (SM) [%] % 7.24
---------------------------------------------------------------------- --------------- ------------------------------
This indicates that the issue may be specific to clang’s implementation of OpenMP offload. Can you use the NVIDIA HPC compiler, or otherwise provide more details on how to get or build the version of clang you are using?
I am aware that it works with the nvc compiler but I sadly cannot use it for my project.
The clang that I use is built from source, specifically the source at this commit (so the version is clang-16).
The excact procedure to build this offloading capable version of LLVM/clang is as follows:
Clone the repository
git clone https://github.com/llvm/llvm-project.git -b main --single-branch
cd llvm-project
git reset --hard 847fa84b3d346313bbad31d4c76b0f70d73827aa
Build the compiler
mkdir build
cd build
cmake ../llvm/ -DCMAKE_BUILD_TYPE=Release -DCLANG_OPENMP_NVPTX_DEFAULT_ARCH=sm_80 -DLIBOMPTARGET_NVPTX_COMPUTE_CAPABILITIES=70,80 -DLLVM_TARGETS_TO_BUILD="X86;NVPTX" -DLLVM_ENABLE_PROJECTS="clang;clang-tools-extra;openmp;compiler-rt"
make -j20
Bootstrap to create some device-specific libraries:
cd ..
mkdir build2
cd build2
CC=../build/bin/clang CXX=../build/bin/clang++ cmake ../llvm/ -DCMAKE_BUILD_TYPE=Release -DCLANG_OPENMP_NVPTX_DEFAULT_ARCH=sm_80 -DLIBOMPTARGET_NVPTX_COMPUTE_CAPABILITIES=70,80 -DLLVM_TARGETS_TO_BUILD="X86;NVPTX" -DLLVM_ENABLE_PROJECTS="clang;clang-tools-extra;openmp;compiler-rt" -DCMAKE_INSTALL_PREFIX=/opt
make -j20
make install
So overall a very complicated process to get to the exact compiler version I am using…
I tried to follow your steps, but building the compiler failed with this error:
/llvm-project/llvm/lib/Support/Compression.cpp: In function ‘void llvm::compression::zlib::compress(llvm::ArrayRef<unsigned char>, llvm::SmallVectorImpl<unsigned char>&, int)’:
/llvm-project/llvm/lib/Support/Compression.cpp:109:36: error: ‘::compressBound’ has not been declared; did you mean ‘compress2’?
109 | unsigned long CompressedSize = ::compressBound(Input.size());
That’s sadly an error I have not yet encountered so I can only take guesses as to why this happens…
What compiler are you using to compile LLVM (for reference: I am using gcc-11.3.0)?
Do you perhaps have access to Singularity (container platform) since I could provide you with a working definition file for that.
And thanks for putting in so much effort into solving this niche problem!
I was able to reproduce the behavior you are seeing when compiling with gcc 11. The problem is that when run under ncu, the OpenMP CUDA runtime library used by clang is not found by the dynamic linker. We will be looking into why that is the case. In the meantime, passing the path to the clang omp libraries manually via LD_LIBRARY_PATH should workaround the problem. This worked for me:
$ LD_LIBRARY_PATH=/opt/lib ncu ./a.out
==PROF== Connected to process 115098 (/var/llvm/a.out)
==PROF== Profiling "__omp_offloading_802_a1546_ma..." - 0: 0%....50%....100% - 10 passes
target
==PROF== Disconnected from process 115098
[115098] a.out@127.0.0.1
__omp_offloading_802_a1546_main_l7 (1, 1, 1)x(33, 1, 1), Context 1, Stream 13, Device 0, CC 8.0
Section: GPU Speed Of Light Throughput
----------------------- ------------- ------------
Metric Name Metric Unit Metric Value
----------------------- ------------- ------------
DRAM Frequency cycle/nsecond 1.29
SM Frequency cycle/usecond 730.72
Elapsed Cycles cycle 29,043
Memory Throughput % 0.09
DRAM Throughput % 0.00
Duration usecond 39.74
L1/TEX Cache Throughput % 3.03
L2 Cache Throughput % 0.09
SM Active Cycles cycle 218.71
Compute (SM) Throughput % 0.02
----------------------- ------------- ------------