Ncu profiling failed to profile specific kernels

I am trying to profile a ML workload using

ncu --target-processes all -k regex:"xmma" -o profile_ncu python3 application.py

However, I am getting the following error.

==ERROR== Unknown Error on device 4.
==ERROR== Failed to profile "sm90_xmma_fprop_implicit_gemm..." in process 12601
==PROF== Trying to shutdown target application
==ERROR== The application returned an error code (9).
==ERROR== An error occurred while trying to profile.
==WARNING== No kernels were profiled.
make: *** [Makefile:46: run_harness] Error 9

May I know what is causing this issue?

Hi, @skps23

Please firstly check if “python3 application.py” can run to end successfully.

Code runs totally fine when ncu is not invoked. Even a simple ncu python3 application.py is causing profiling error.

1.Please try to profile a simple CUDA sample, not python script to see if this can repro.
2. Please provide the exact python version in your machine

It is exactly the Nvidia’s submission of DLRM workload in MLPerf Inference 4.1. This entire run is from via the docker image provided by Nvidia. The above run is DLRM inference workload in Offline mode : This is the repo

Regarding 1) I have tried profiling a helloworld program using ncu. Following is the code:

#include <stdio.h>

__global__ void helloWorld() {
    printf("Hello, World from GPU!\n");
}

int main() {
    printf("Hello, World from CPU!\n");
    helloWorld<<<1, 1>>>();
    cudaDeviceSynchronize();
    return 0;
}

The code is compiled with nvcc helloworld.cu -o helloworld

$ ./helloworld

Hello, World from CPU!
Hello, World from GPU!

However when running with ncu

$ ncu ./helloworld
Hello, World from CPU!
==PROF== Connected to process 21957 (/work/helloworld)
==ERROR== Failed to prepare kernel for profiling

==ERROR== Unknown Error on device 0.
==ERROR== Failed to profile "helloWorld()" in process 21957
==PROF== Trying to shutdown target application
==ERROR== The application returned an error code (9).
==ERROR== An error occurred while trying to profile.
==WARNING== No kernels were profiled.

Is above error correlated to my earlier error?

Thanks.
Are you running ncu ./helloworld in docker also ?
Can you tell which driver / GPU / OS and ncu version do you use ?

I am running the helloworld program inside docker. The machine is DGX H200x8, Ubuntu 22.04.4 LTS (Jammy Jellyfish)

Inside the docker ncu is Version 2024.1.1.0 (build 33998838) (public-release)

I tried the ncu ./helloworld program outside the docker and it seemed to work fine. The ncu version on the system (outside the docker) is 2024.3.1.0 (build 34702747) (public-release)

Thanks for provoking this discussion. I installed 2024.3 version of ncu also with in the docker image. Now I am able to profile the kernels with ncu inside the docker.

The purpose of this task to get to know the input matrix sizes passed to the kernel. I have posted a question here.

Can you help me if we can get input matrix size info with ncu?

The interactive profiling activity shows the api parameters for each CUDA function call and kernel launch, but there is no option to capture and export these from the non-interactive activity or command line.

Unfortunately, I do not have GUI access and this needs to be done via terminal. I am looking for solutions via terminal.