Profiling using Nsight compute CLI python script with Cupy

I am trying to profile a deep learning script, which uses cupy to compute on Nvidia GPU. Using Nsight compute ( ncu cli)

iwia050h@a0705:~/numpy-transformer-master/transformer$ ncu python3.9
Background: ncu without any paramters in cli, and running on complete script works fine. I am facing a problem with profiling specific parts of the code.

The profiling is expectedly to take 651 hours., for that reason I want to profile only specific part.

The aim is to Profile the “” or "attention"part of the code, the outcome would be a .rep file that could be opened with Nsight compute.

The script returns attention
when I run it calls a subroutine knows as and I want to profile the specic part when this function is being called -

Initially loading the modules and allocating a node on Alex.

The directory with node is iwia050h@a0705:~/numpy-transformer-master/transformer$

CLI commands I have used are:

iwia050h@a0705:~/numpy-transformer-master/transformer$ ** ncu -k regex:attention python3.9**

iwia050h@a0705:~/numpy-transformer-master/transformer$ ncu -k attention python3.9

iwia050h@a0705:~/numpy-transformer-master/transformer$ ncu --kernel-name attention python3.9

iwia050h@a0705:~/numpy-transformer-master/transformer$ ncu --kernel-name python3.9

It starts the profiling with :

==PROF== Connected to process 356412 (/apps/python/3.9-anaconda/bin/python3.9)

After the code is run the Error code I receive is :

==PROF== Disconnected from process 356412
==WARNING== No kernels were profiled.
==WARNING== Profiling kernels launched by child processes requires the --target-processes all option.

It is not profiling the self_attention or attention, could you please guide to it?

What part am i doing wrong?

Thanks for reaching out. For reference, what GPU is this running on and what version of CUDA is installed?

Do you have access to GUI for an interactive profile? One thing to try would be stepping through the APIs until you reach the kernel of interest. Then you could profile it. You could also verify the compiled name of the kernel to use in the CLI filters.

I’m not sure how the kernel names appear to Nsight Compute from a numpy python-based kernel. Perhaps they are mangled somehow and don’t match the regex you provide.

It may also be true that you need that flag “-target-processes all” since the python may launch child processes. Try using that flag without any kernel regex to see if you at least encounter some kernels.

You could also try and Nsight Systems profile that would be much lower overhead and should report the names of the CUDA kernels that were profiled to see if that behaves as expected.

Thank you for the reply.
I am using NVIDIA A100-SXM4-40GB , and cuda version Build cuda_12.1.
I have profiled the code using Nsighs System and could see the various kernels name.
nsight . I am trying to profile the kernel “cupy_sum” using the command for nsight compute

ncu --target-processes all -k cupy_sum -o cupy_sum_1 python3.9

but the error received is:

CuPy is available. Using CuPy for all computations.
train data sequences num = 29000
EN vocab length = 5959; DE vocab length = 7801
batch num = 907
==PROF== Connected to process 305653 (/apps/python/3.9-anaconda/bin/python3.9)
Traceback (most recent call last):
File “/home/hpc/iwia/iwia050h/numpy-transformer-master/transformer/”, line 297, in
File “/home/hpc/iwia/iwia050h/numpy-transformer-master/transformer/modules/”, line 19, in init
self.token_embedding = Embedding(src_vocab_size, d_model, data_type)
File “/home/hpc/iwia/iwia050h/numpy-transformer-master/transformer/layers/base/”, line 30, in init
File “/home/hpc/iwia/iwia050h/numpy-transformer-master/transformer/layers/base/”, line 37, in build
self.w = np.random.normal(0, pow(self.input_dim, -0.5), (self.input_dim, self.output_dim)).astype(self.data_type)
File “/home/hpc/iwia/iwia050h/.local/lib/python3.9/site-packages/cupy/random/”, line 501, in normal
return rs.normal(loc, scale, size, dtype)
File “/home/hpc/iwia/iwia050h/.local/lib/python3.9/site-packages/cupy/random/”, line 462, in normal
x = self._generate_normal(func, size, dtype, loc, scale)
File “/home/hpc/iwia/iwia050h/.local/lib/python3.9/site-packages/cupy/random/”, line 81, in _generate_normal
func(self._generator,, out.size, *args)
File “cupy_backends/cuda/libs/curand.pyx”, line 191, in cupy_backends.cuda.libs.curand.generateNormalDouble
File “cupy_backends/cuda/libs/curand.pyx”, line 200, in cupy_backends.cuda.libs.curand.generateNormalDouble
File “cupy_backends/cuda/libs/curand.pyx”, line 88, in cupy_backends.cuda.libs.curand.check_status
cupy_backends.cuda.libs.curand.CURANDError: CURAND_STATUS_LAUNCH_FAILURE
==PROF== Disconnected from process 305653
==ERROR== The application returned an error code (1).
==WARNING== No kernels were profiled.

Since I know the kernel name, how should I solve this error now?

It’s tough to say. From that output, it looks like the python application is failing. to sanity check, can you share the output of running “python3.9” and then running “ncu --target-processes all -k cupy_sum -o cupy_sum_1 python3.9” immediately after?