test opennmp code profiles with nvprof but does not profile with ncu, it returns error code (11). This case looks similar to Nsight Compute does not detect kernel launches for OpenMP offloaded code, but the error is different. I tried the LD_LIBRARY_PATH solution as well, and it did return same error code 11.
> ncu ./saxpy_clang
==ERROR== The application returned an error code (11).
==WARNING== No kernels were profiled.
==WARNING== Profiling kernels launched by child processes requires the --target-processes all option.
which clang++
/project/dsi/apps/llvm/17.0.5/bin/clang++
> LD_LIBRARY_PATH=/project/dsi/apps/llvm/17.0.5/lib:$LD_LIBRARY_PATH ncu ./saxpy_clang
==PROF== Target process 2593864 terminated before first instrumented API call.
==ERROR== The application returned an error code (11).
==WARNING== No kernels were profiled.
==WARNING== Profiling kernels launched by child processes requires the --target-processes all option.
> clang++ -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda saxpy.cpp -o saxpy_clang
> ./saxpy_clang
Time of kernel: 0.140592
> nvprof ./saxpy_clang
==2593442== NVPROF is profiling process 2593442, command: ./saxpy_clang
==2593442== Warning: Profiling results might be incorrect with current version of nvcc compiler used to compile cuda app. Compile with nvcc compiler 9.0 or later version to get correct profiling results. Ignore this warning if code is already compiled with the recommended nvcc version
Time of kernel: 0.272970
==2593442== Profiling application: ./saxpy_clang
==2593442== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 97.46% 243.30us 1 243.30us 243.30us 243.30us __omp_offloading_3a_1b618a2__Z5saxpyfPfS_i_l9
1.71% 4.2570us 3 1.4190us 1.3120us 1.5050us [CUDA memcpy HtoD]
0.83% 2.0800us 1 2.0800us 2.0800us 2.0800us [CUDA memcpy DtoH]
API calls: 75.35% 256.90ms 1 256.90ms 256.90ms 256.90ms cuDevicePrimaryCtxRetain
19.84% 67.633ms 1 67.633ms 67.633ms 67.633ms cuDevicePrimaryCtxRelease
2.41% 8.2156ms 1 8.2156ms 8.2156ms 8.2156ms cuModuleLoadDataEx
1.74% 5.9242ms 1 5.9242ms 5.9242ms 5.9242ms cuLaunchKernel
0.18% 613.33us 32 19.166us 2.6290us 317.01us cuStreamCreate
0.14% 463.21us 1 463.21us 463.21us 463.21us cuModuleUnload
0.13% 439.62us 2 219.81us 11.161us 428.45us cuMemAlloc
0.07% 237.81us 1 237.81us 237.81us 237.81us cuMemcpyDtoHAsync
0.04% 132.47us 2 66.237us 17.837us 114.64us cuMemFree
0.04% 124.43us 32 3.8880us 2.9910us 15.030us cuStreamDestroy
0.03% 113.62us 2 56.810us 7.6280us 105.99us cuStreamSynchronize
0.02% 57.325us 3 19.108us 10.830us 25.945us cuMemcpyHtoDAsync
0.01% 20.870us 32 652ns 559ns 1.8260us cuEventCreate
0.01% 18.234us 1 18.234us 18.234us 18.234us cuDeviceGetPCIBusId
0.00% 16.072us 32 502ns 393ns 2.1330us cuEventDestroy
0.00% 14.364us 3 4.7880us 2.9170us 7.1970us cuEventRecord
0.00% 11.667us 15 777ns 279ns 1.9880us cuCtxSetCurrent
0.00% 6.1770us 10 617ns 267ns 1.2040us cuDeviceGetAttribute
0.00% 4.7130us 1 4.7130us 4.7130us 4.7130us cuStreamWaitEvent
0.00% 3.6460us 3 1.2150us 790ns 1.7620us cuDeviceGet
0.00% 2.9630us 3 987ns 423ns 1.5370us cuDeviceGetCount
0.00% 2.6700us 1 2.6700us 2.6700us 2.6700us cuModuleGetGlobal
0.00% 1.5620us 2 781ns 397ns 1.1650us cuCtxGetLimit
0.00% 1.4900us 1 1.4900us 1.4900us 1.4900us cuModuleGetFunction
0.00% 1.3260us 1 1.3260us 1.3260us 1.3260us cuFuncGetAttribute
0.00% 1.2230us 1 1.2230us 1.2230us 1.2230us cuDevicePrimaryCtxGetState
0.00% 860ns 1 860ns 860ns 860ns cuDevicePrimaryCtxSetFlags
> ncu -c 1 ./saxpy_clang
==ERROR== The application returned an error code (11).
==WARNING== No kernels were profiled.
==WARNING== Profiling kernels launched by child processes requires the --target-processes all option.
> clang++ --version
clang version 17.0.5
Target: x86_64-redhat-linux-gnu
Thread model: posix
InstalledDir: /project/dsi/apps/llvm/17.0.5/bin
> llvm-omp-device-info
Device (0):
Device Type Generic-elf-64bit
Device (1):
Device Type Generic-elf-64bit
Device (2):
Device Type Generic-elf-64bit
Device (3):
Device Type Generic-elf-64bit
Device (4):
CUDA Driver Version 11040
CUDA OpenMP Device Number 0
Device Name NVIDIA GeForce GTX 780
Global Memory Size 3168534528 bytes
Number of Multiprocessors 12
Concurrent Copy and Execution Yes
Total Constant Memory 65536 bytes
Max Shared Memory per Block 49152 bytes
Registers per Block 65536
Warp Size 32
Maximum Threads per Block 1024
Maximum Block Dimensions
x 1024
y 1024
z 64
Maximum Grid Dimensions
x 2147483647
y 65535
z 65535
Maximum Memory Pitch 2147483647 bytes
Texture Alignment 512 bytes
Clock Rate 901500 kHz
Execution Timeout Yes
Integrated Device No
Can Map Host Memory Yes
Compute Mode Default
Concurrent Kernels Yes
ECC Enabled No
Memory Clock Rate 3004000 kHz
Memory Bus Width 384 bits
L2 Cache Size 1572864 bytes
Max Threads Per SMP 2048
Async Engines 1
Unified Addressing Yes
Managed Memory Yes
Concurrent Managed Memory No
Preemption Supported No
Cooperative Launch No
Multi-Device Boars No
Compute Capabilities sm_35
This is the test code
#include <iostream>
#include <omp.h>
#include <cstdlib>
void saxpy(float a, float* x, float* y, int sz) {
double t = 0.0;
double tb, te;
tb = omp_get_wtime();
#pragma omp target teams distribute parallel for map(to:x[0:sz]) map(tofrom:y[0:sz])
{
for (int i = 0; i < sz; i++) {
y[i] = a * x[i] + y[i];
}
}
te = omp_get_wtime();
t = te - tb;
printf("Time of kernel: %lf\n", t);
}
int main() {
auto x = (float*) malloc(1000 * sizeof(float));
auto y = (float*) calloc(1000, sizeof(float));
for (int i = 0; i < 1000; i++) {
x[i] = i;
}
saxpy(42, x, y, 1000);
return 0;
}