Ncu does not detect kernels, ==ERROR== The application returned an error code (11)

jebalunode · November 25, 2023, 6:07pm

test opennmp code profiles with nvprof but does not profile with ncu, it returns error code (11). This case looks similar to Nsight Compute does not detect kernel launches for OpenMP offloaded code, but the error is different. I tried the LD_LIBRARY_PATH solution as well, and it did return same error code 11.

> ncu  ./saxpy_clang 
==ERROR== The application returned an error code (11).
==WARNING== No kernels were profiled.
==WARNING== Profiling kernels launched by child processes requires the --target-processes all option.

 which clang++
/project/dsi/apps/llvm/17.0.5/bin/clang++

> LD_LIBRARY_PATH=/project/dsi/apps/llvm/17.0.5/lib:$LD_LIBRARY_PATH ncu  ./saxpy_clang 
==PROF== Target process 2593864 terminated before first instrumented API call.
==ERROR== The application returned an error code (11).
==WARNING== No kernels were profiled.
==WARNING== Profiling kernels launched by child processes requires the --target-processes all option.


>  clang++ -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda saxpy.cpp -o saxpy_clang

> ./saxpy_clang 
Time of kernel: 0.140592

> nvprof ./saxpy_clang 
==2593442== NVPROF is profiling process 2593442, command: ./saxpy_clang
==2593442== Warning: Profiling results might be incorrect with current version of nvcc compiler used to compile cuda app. Compile with nvcc compiler 9.0 or later version to get correct profiling results. Ignore this warning if code is already compiled with the recommended nvcc version 
Time of kernel: 0.272970
==2593442== Profiling application: ./saxpy_clang
==2593442== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   97.46%  243.30us         1  243.30us  243.30us  243.30us  __omp_offloading_3a_1b618a2__Z5saxpyfPfS_i_l9
                    1.71%  4.2570us         3  1.4190us  1.3120us  1.5050us  [CUDA memcpy HtoD]
                    0.83%  2.0800us         1  2.0800us  2.0800us  2.0800us  [CUDA memcpy DtoH]
      API calls:   75.35%  256.90ms         1  256.90ms  256.90ms  256.90ms  cuDevicePrimaryCtxRetain
                   19.84%  67.633ms         1  67.633ms  67.633ms  67.633ms  cuDevicePrimaryCtxRelease
                    2.41%  8.2156ms         1  8.2156ms  8.2156ms  8.2156ms  cuModuleLoadDataEx
                    1.74%  5.9242ms         1  5.9242ms  5.9242ms  5.9242ms  cuLaunchKernel
                    0.18%  613.33us        32  19.166us  2.6290us  317.01us  cuStreamCreate
                    0.14%  463.21us         1  463.21us  463.21us  463.21us  cuModuleUnload
                    0.13%  439.62us         2  219.81us  11.161us  428.45us  cuMemAlloc
                    0.07%  237.81us         1  237.81us  237.81us  237.81us  cuMemcpyDtoHAsync
                    0.04%  132.47us         2  66.237us  17.837us  114.64us  cuMemFree
                    0.04%  124.43us        32  3.8880us  2.9910us  15.030us  cuStreamDestroy
                    0.03%  113.62us         2  56.810us  7.6280us  105.99us  cuStreamSynchronize
                    0.02%  57.325us         3  19.108us  10.830us  25.945us  cuMemcpyHtoDAsync
                    0.01%  20.870us        32     652ns     559ns  1.8260us  cuEventCreate
                    0.01%  18.234us         1  18.234us  18.234us  18.234us  cuDeviceGetPCIBusId
                    0.00%  16.072us        32     502ns     393ns  2.1330us  cuEventDestroy
                    0.00%  14.364us         3  4.7880us  2.9170us  7.1970us  cuEventRecord
                    0.00%  11.667us        15     777ns     279ns  1.9880us  cuCtxSetCurrent
                    0.00%  6.1770us        10     617ns     267ns  1.2040us  cuDeviceGetAttribute
                    0.00%  4.7130us         1  4.7130us  4.7130us  4.7130us  cuStreamWaitEvent
                    0.00%  3.6460us         3  1.2150us     790ns  1.7620us  cuDeviceGet
                    0.00%  2.9630us         3     987ns     423ns  1.5370us  cuDeviceGetCount
                    0.00%  2.6700us         1  2.6700us  2.6700us  2.6700us  cuModuleGetGlobal
                    0.00%  1.5620us         2     781ns     397ns  1.1650us  cuCtxGetLimit
                    0.00%  1.4900us         1  1.4900us  1.4900us  1.4900us  cuModuleGetFunction
                    0.00%  1.3260us         1  1.3260us  1.3260us  1.3260us  cuFuncGetAttribute
                    0.00%  1.2230us         1  1.2230us  1.2230us  1.2230us  cuDevicePrimaryCtxGetState
                    0.00%     860ns         1     860ns     860ns     860ns  cuDevicePrimaryCtxSetFlags

> ncu -c 1 ./saxpy_clang 
==ERROR== The application returned an error code (11).
==WARNING== No kernels were profiled.
==WARNING== Profiling kernels launched by child processes requires the --target-processes all option.


 > clang++ --version
clang version 17.0.5
Target: x86_64-redhat-linux-gnu
Thread model: posix
InstalledDir: /project/dsi/apps/llvm/17.0.5/bin

> llvm-omp-device-info 
Device (0):
    Device Type    Generic-elf-64bit

Device (1):
    Device Type    Generic-elf-64bit

Device (2):
    Device Type    Generic-elf-64bit

Device (3):
    Device Type    Generic-elf-64bit

Device (4):
    CUDA Driver Version              11040
    CUDA OpenMP Device Number        0
    Device Name                      NVIDIA GeForce GTX 780
    Global Memory Size               3168534528 bytes
    Number of Multiprocessors        12
    Concurrent Copy and Execution    Yes
    Total Constant Memory            65536 bytes
    Max Shared Memory per Block      49152 bytes
    Registers per Block              65536
    Warp Size                        32
    Maximum Threads per Block        1024
    Maximum Block Dimensions         
        x                            1024
        y                            1024
        z                            64
    Maximum Grid Dimensions          
        x                            2147483647
        y                            65535
        z                            65535
    Maximum Memory Pitch             2147483647 bytes
    Texture Alignment                512 bytes
    Clock Rate                       901500 kHz
    Execution Timeout                Yes
    Integrated Device                No
    Can Map Host Memory              Yes
    Compute Mode                     Default
    Concurrent Kernels               Yes
    ECC Enabled                      No
    Memory Clock Rate                3004000 kHz
    Memory Bus Width                 384 bits
    L2 Cache Size                    1572864 bytes
    Max Threads Per SMP              2048
    Async Engines                    1
    Unified Addressing               Yes
    Managed Memory                   Yes
    Concurrent Managed Memory        No
    Preemption Supported             No
    Cooperative Launch               No
    Multi-Device Boars               No
    Compute Capabilities             sm_35

This is the test code

#include <iostream>
#include <omp.h>
#include <cstdlib>

void saxpy(float a, float* x, float* y, int sz) {
	double t = 0.0;
	double tb, te;
	tb = omp_get_wtime();
#pragma omp target teams distribute parallel for map(to:x[0:sz]) map(tofrom:y[0:sz])
{
	for (int i = 0; i < sz; i++) {
		y[i] = a * x[i] + y[i];
	}
}
	te = omp_get_wtime();
	t = te - tb;
	printf("Time of kernel: %lf\n", t);
}

int main() {
	auto x = (float*) malloc(1000 * sizeof(float));
	auto y = (float*) calloc(1000, sizeof(float));
	
	for (int i = 0; i < 1000; i++) {
		x[i] = i;
	}
	
	saxpy(42, x, y, 1000);
	
    return 0;
}

veraj · November 28, 2023, 6:37am

Hi, @jebalunode

NVIDIA GeForce GTX 780 is based on Kepler architecture which is not supported by NCU actually. See Release Notes :: Nsight Compute Documentation

jebalunode · November 28, 2023, 7:32pm

The issue is still the same on a volta card.
ncu ./zaxpy_clang

==ERROR== The application returned an error code (11).

==WARNING== No kernels were profiled.

==WARNING== Profiling kernels launched by child processes requires the --target-processes all option.

See log below for more detail.

> llvm-omp-device-info 
Device (0):
    Device Type    Generic-elf-64bit

Device (1):
    Device Type    Generic-elf-64bit

Device (2):
    Device Type    Generic-elf-64bit

Device (3):
    Device Type    Generic-elf-64bit

Device (4):
    CUDA Driver Version              12010
    CUDA OpenMP Device Number        0
    Device Name                      Tesla V100-PCIE-32GB
    Global Memory Size               34079899648 bytes
    Number of Multiprocessors        80
    Concurrent Copy and Execution    Yes
    Total Constant Memory            65536 bytes
    Max Shared Memory per Block      49152 bytes
    Registers per Block              65536
    Warp Size                        32
    Maximum Threads per Block        1024
    Maximum Block Dimensions         
        x                            1024
        y                            1024
        z                            64
    Maximum Grid Dimensions          
        x                            2147483647
        y                            65535
        z                            65535
    Maximum Memory Pitch             2147483647 bytes
    Texture Alignment                512 bytes
    Clock Rate                       1380000 kHz
    Execution Timeout                No
    Integrated Device                No
    Can Map Host Memory              Yes
    Compute Mode                     Default
    Concurrent Kernels               Yes
    ECC Enabled                      Yes
    Memory Clock Rate                877000 kHz
    Memory Bus Width                 4096 bits
    L2 Cache Size                    6291456 bytes
    Max Threads Per SMP              2048
    Async Engines                    7
    Unified Addressing               Yes
    Managed Memory                   Yes
    Concurrent Managed Memory        Yes
    Preemption Supported             Yes
    Cooperative Launch               Yes
    Multi-Device Boars               No
    Compute Capabilities             sm_70

> clang++  -v
clang version 17.0.5
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /project/dsi/apps/easybuild/software/Clang/17.0.5-GCCcore-11.2.0-CUDA-12.0.0/bin
Found candidate GCC installation: /project/dsi/apps/easybuild/software/GCCcore/11.2.0/lib/gcc/x86_64-pc-linux-gnu/11.2.0
Selected GCC installation: /project/dsi/apps/easybuild/software/GCCcore/11.2.0/lib/gcc/x86_64-pc-linux-gnu/11.2.0
Candidate multilib: .;@m64
Selected multilib: .;@m64
Found CUDA installation: /project/dsi/apps/easybuild/software/CUDA/12.0.0, version 12.0

>clang++ -O3 -fopenmp -fopenmp-targets=nvptx64 zaxpy.cpp  -o zaxpy_clang --offload-arch=sm_70 -fopenmp-offload-mandatory

>  ./zaxpy_clang 
Total compute time:	0.082392 seconds

>  nvprof ./zaxpy_clang 
==158324== NVPROF is profiling process 158324, command: ./zaxpy_clang
Total compute time:	0.132908 seconds
==158324== Profiling application: ./zaxpy_clang
==158324== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   49.09%  6.0470us         3  2.0150us  1.7600us  2.1440us  [CUDA memcpy HtoD]
                   33.25%  4.0960us         1  4.0960us  4.0960us  4.0960us  __omp_offloading_3f_40381f__Z5zaxpyPdS_dm_l7
                   17.66%  2.1760us         1  2.1760us  2.1760us  2.1760us  [CUDA memcpy DtoH]
      API calls:   79.26%  132.08ms         1  132.08ms  132.08ms  132.08ms  cuDevicePrimaryCtxRetain
                   20.17%  33.602ms         1  33.602ms  33.602ms  33.602ms  cuDevicePrimaryCtxRelease
                    0.14%  235.92us        32  7.3720us  1.2690us  102.50us  cuStreamCreate
                    0.10%  174.41us         1  174.41us  174.41us  174.41us  cuModuleLoadDataEx
                    0.07%  111.48us         2  55.741us  8.5870us  102.90us  cuMemFree
                    0.05%  89.017us         2  44.508us  2.4220us  86.595us  cuMemAlloc
                    0.05%  84.798us         1  84.798us  84.798us  84.798us  cuModuleUnload
                    0.04%  65.202us        32  2.0370us  1.4770us  6.4970us  cuStreamDestroy
                    0.03%  54.487us         4  13.621us  12.196us  14.890us  cuStreamSynchronize
                    0.02%  40.014us         1  40.014us  40.014us  40.014us  cuMemcpyDtoHAsync
                    0.01%  22.920us         3  7.6400us  5.3560us  9.2740us  cuMemcpyHtoDAsync
                    0.01%  14.947us         1  14.947us  14.947us  14.947us  cuLaunchKernel
                    0.01%  14.328us        32     447ns     348ns  1.4970us  cuEventCreate
                    0.01%  9.1330us        32     285ns     209ns     831ns  cuEventDestroy
                    0.00%  5.6950us         3  1.8980us  1.3240us  2.6940us  cuEventRecord
                    0.00%  5.1230us         2  2.5610us     226ns  4.8970us  cuCtxGetLimit
                    0.00%  5.0900us         1  5.0900us  5.0900us  5.0900us  cuDeviceGetPCIBusId
                    0.00%  4.7990us        15     319ns     143ns  1.1790us  cuCtxSetCurrent
                    0.00%  3.5480us         1  3.5480us  3.5480us  3.5480us  cuDevicePrimaryCtxGetState
                    0.00%  3.2360us        10     323ns     166ns     735ns  cuDeviceGetAttribute
                    0.00%  2.4820us         3     827ns     485ns  1.3200us  cuStreamWaitEvent
                    0.00%  1.7560us         3     585ns     203ns  1.0490us  cuDeviceGet
                    0.00%  1.7300us         3     576ns     194ns  1.2140us  cuDeviceGetCount
                    0.00%  1.0320us         1  1.0320us  1.0320us  1.0320us  cuModuleGetGlobal
                    0.00%     850ns         1     850ns     850ns     850ns  cuModuleGetFunction
                    0.00%     793ns         1     793ns     793ns     793ns  cuFuncGetAttribute
                    0.00%     474ns         1     474ns     474ns     474ns  cuDevicePrimaryCtxSetFlags

>ncu  ./zaxpy_clang 
==ERROR== The application returned an error code (11).
==WARNING== No kernels were profiled.
==WARNING== Profiling kernels launched by child processes requires the --target-processes all option.

veraj · November 30, 2023, 11:16am

Hi, @jebalunode

Thanks for reporting this to us ! We can reproduce this internally. Our dev will have a check. Will let you know if there is any progress. Thanks !

felix_dt · December 4, 2023, 11:39am

Running this application under a debugger, we can see that LLVM’s OpenMP implementation calls the CUDA API from a static library initializer. According to the CUDA documentation, this is not allowed.

The CUDA interfaces use global state that is initialized during host program initiation and destroyed during host program termination. The CUDA runtime and driver cannot detect if this state is invalid, so using any of these interfaces (implicitly or explicitly) during program initiation (or termination after main) will result in undefined behavior.

Nsight Computes relies on application’s using this interface correctly. We will try to detect this specific scenario and issue a proper error message, but the underlying problem ultimately needs to be addressed by LLVM itself.

#0  0x00007ffff28d9310 in cuInit () from /lib/x86_64-linux-gnu/libcuda.so.1
#1  0x00007ffff42f45f0 in llvm::omp::target::plugin::CUDAPluginTy::initImpl() () from /opt/lib/libomptarget.rtl.cuda.so
#2  0x00007ffff42ff82f in llvm::omp::target::plugin::GenericPluginTy::init() () from /opt/lib/libomptarget.rtl.cuda.so
#3  0x00007ffff42f6529 in llvm::omp::target::plugin::Plugin::Plugin() () from /opt/lib/libomptarget.rtl.cuda.so
#4  0x00007ffff42ffa8e in __tgt_rtl_init_plugin () from /opt/lib/libomptarget.rtl.cuda.so
#5  0x00007ffff7ae374d in RTLsTy::attemptLoadRTL(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, RTLInfoTy&) () from /opt/lib/libomptarget.so.18git
#6  0x00007ffff7ae3435 in RTLsTy::loadRTLs() () from /opt/lib/libomptarget.so.18git
#7  0x00007ffff7ae31a8 in init() () from /opt/lib/libomptarget.so.18git
#8  0x00007ffff7fe0b9a in call_init (l=<optimized out>, argc=argc@entry=1, argv=argv@entry=0x7fffffffe2b8, env=env@entry=0x7fffffffe2c8) at dl-init.c:72
#9  0x00007ffff7fe0ca1 in call_init (env=0x7fffffffe2c8, argv=0x7fffffffe2b8, argc=1, l=<optimized out>) at dl-init.c:30
#10 _dl_init (main_map=0x7ffff7ffe190, argc=1, argv=0x7fffffffe2b8, env=0x7fffffffe2c8) at dl-init.c:119
#11 0x00007ffff7fd013a in _dl_start_user () from /lib64/ld-linux-x86-64.so.2
#12 0x0000000000000001 in ?? ()
#13 0x00007fffffffe574 in ?? ()
#14 0x0000000000000000 in ?? ()

felix_dt · December 13, 2023, 9:31am

We raised the issue with LLVM: openmp cuda offload incorrectly calls cuda from a static initializer · Issue #74507 · llvm/llvm-project · GitHub

Future versions of Nsight Compute will show a better error message in this case, but the fix needs to be done in LLVM OpenMP offload itself.

system · December 27, 2023, 9:31am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Nsight Compute does not detect kernel launches for OpenMP offloaded code Nsight Compute profiling	11	1547	February 28, 2023
==ERROR== Failed to prepare kernel for profiling (0xc00000fd) but CUDA sample works Nsight Compute kernel , nvbugs	13	2051	November 6, 2021
Run ncu command in ubuntu 20.04 Nsight Compute	7	5319	August 8, 2022
Nv-nsight-cu-cli segfault Nsight Compute cuda	6	1753	May 5, 2020
Nsight Compute not detecting kernel launch Nsight Compute profiling	13	3071	May 6, 2021
Nsight-Compute returns “No kernels were profiled” warning Nsight Compute	9	1420	July 27, 2023
No kernels were profiled warning/problem Nsight Compute	17	10341	December 28, 2021
NSight Compute not finding kernels Nsight Compute	24	704	October 24, 2024
Can't Get NCU GUI To Import Properly Nsight Compute	8	1338	October 5, 2020
About using ncu to profile the python code, which further called cu kernels Nsight Compute	13	969	June 15, 2024

Ncu does not detect kernels, ==ERROR== The application returned an error code (11)

Related topics