Profiling using Nsight compute CLI python script with Cupy

I am trying to profile a deep learning script, which uses cupy to compute on Nvidia GPU. Using Nsight compute ( ncu cli)

iwia050h@a0705:~/numpy-transformer-master/transformer$ ncu python3.9
Background: ncu without any paramters in cli, and running on complete script works fine. I am facing a problem with profiling specific parts of the code.

The profiling is expectedly to take 651 hours., for that reason I want to profile only specific part.

The aim is to Profile the “” or "attention"part of the code, the outcome would be a .rep file that could be opened with Nsight compute.

The script returns attention
when I run it calls a subroutine knows as and I want to profile the specic part when this function is being called -

Initially loading the modules and allocating a node on Alex.

The directory with node is iwia050h@a0705:~/numpy-transformer-master/transformer$

CLI commands I have used are:

iwia050h@a0705:~/numpy-transformer-master/transformer$ ** ncu -k regex:attention python3.9**

iwia050h@a0705:~/numpy-transformer-master/transformer$ ncu -k attention python3.9

iwia050h@a0705:~/numpy-transformer-master/transformer$ ncu --kernel-name attention python3.9

iwia050h@a0705:~/numpy-transformer-master/transformer$ ncu --kernel-name python3.9

It starts the profiling with :

==PROF== Connected to process 356412 (/apps/python/3.9-anaconda/bin/python3.9)

After the code is run the Error code I receive is :

==PROF== Disconnected from process 356412
==WARNING== No kernels were profiled.
==WARNING== Profiling kernels launched by child processes requires the --target-processes all option.

It is not profiling the self_attention or attention, could you please guide to it?

What part am i doing wrong?

Thanks for reaching out. For reference, what GPU is this running on and what version of CUDA is installed?

Do you have access to GUI for an interactive profile? One thing to try would be stepping through the APIs until you reach the kernel of interest. Then you could profile it. You could also verify the compiled name of the kernel to use in the CLI filters.

I’m not sure how the kernel names appear to Nsight Compute from a numpy python-based kernel. Perhaps they are mangled somehow and don’t match the regex you provide.

It may also be true that you need that flag “-target-processes all” since the python may launch child processes. Try using that flag without any kernel regex to see if you at least encounter some kernels.

You could also try and Nsight Systems profile that would be much lower overhead and should report the names of the CUDA kernels that were profiled to see if that behaves as expected.