Nsight-Compute returns “No kernels were profiled” warning

huang_wei1 · July 14, 2023, 6:24am

Hi,

I’m running my application with

ncu --kernel-name ncclKernel_AllReduce_RING_LL_Sum_uint8_t --launch-skip 1 --launch-count 1 “bash” examples/pretrain_gpt_distributed_with_mp.sh

The return message shows:

==WARNING== No kernels were profiled.
==WARNING== Profiling kernels launched by child processes requires the --target-processes all option.

I am using a V100 and run script in a docker.
Cuda version is 11.4 and ncu version is Version 2021.2.2.0…

Could you help me identify the issue?
Thanks!

huang_wei1 · July 14, 2023, 6:56am

When I profile a simple vector-addition code in the same environment, it works fine.
Some information is as follows：

ncu ./add_demo
==PROF== Connected to process 40693 (/home/ubt/cuda_test/add_demo)
==PROF== Profiling “matrix_add_2D” - 1: 0%…50%…100% - 20 passes
Success!
==PROF== Disconnected from process 40693
[40693] add_demo@127.0.0.1
matrix_add_2D(const unsigned int ()[1024], const unsigned int ()[1024], unsigned int (*)[1024], unsigned long, unsigned long), 2023-Jul-14 06:44:00, Context 1, Stream 7
Section: GPU Speed Of Light Throughput
---------------------------------------------------------------------- --------------- ------------------------------
DRAM Frequency cycle/usecond 870.02
SM Frequency cycle/nsecond 1.29
Elapsed Cycles cycle 72319
Memory [%] % 56.86
DRAM Throughput % 24.15
Duration usecond 56.26
L1/TEX Cache Throughput % 81.58
L2 Cache Throughput % 53.48
SM Active Cycles cycle 67114.98
Compute (SM) [%] % 4.53
---------------------------------------------------------------------- --------------- ------------------------------
WRN This kernel exhibits low compute throughput and memory bandwidth utilization relative to the peak performance
of this device. Achieved compute throughput and/or memory bandwidth below 60.0% of peak typically indicate
latency issues. Look at Scheduler Statistics and Warp State Statistics for potential reasons.

jmarusarz · July 14, 2023, 4:31pm

Can you try adding the “–target-processes all” flag to your initial command to see if that helps?

huang_wei1 · July 15, 2023, 4:06am

Yes, when I add the “–target-processes all” flag to run the script, the log output is as follows

==PROF== Target process 1266 terminated before first instrumented API call.
==PROF== Target process 1267 terminated before first instrumented API call.
==PROF== Target process 1269 terminated before first instrumented API call.
==PROF== Target process 1270 terminated before first instrumented API call.
==PROF== Target process 1271 terminated before first instrumented API call.
==PROF== Target process 1272 terminated before first instrumented API call.
==PROF== Target process 1274 terminated before first instrumented API call.
==PROF== Target process 1275 terminated before first instrumented API call.

After a long time, it seems to be blocking, with no new log output

huang_wei1 · July 17, 2023, 3:20am

Any ideas?

huang_wei1 · July 17, 2023, 9:55am

If I change the configuration, set the number of gpus in use to 1，ncu will work fine. Is there any way to use ncu to monitor when using multiple gpus

jmarusarz · July 20, 2023, 3:17pm

There is a “devices” flag that allows you to explicitly state which devices (via device IDs) you want to enable profiling on. You can try some experiments with this to see if it helps. What type of system are you on? Can you share the output of “nvidia-smi”?

huang_wei1 · July 21, 2023, 2:22am

Ubuntu 18.04.6 LTS
Cuda version is 11.4 and ncu version is Version 2021.2.2.0 in docker container.
Docker image is nvcr.io/nvidia/pytorch.
nvidia-smi output:

±----------------------------------------------------------------------------+
| NVIDIA-SMI 520.61.05 Driver Version: 520.61.05 CUDA Version: 11.8 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2… On | 00000000:06:00.0 Off | 0 |
| N/A 37C P0 58W / 300W | 0MiB / 32768MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 1 Tesla V100-SXM2… On | 00000000:07:00.0 Off | 0 |
| N/A 38C P0 42W / 300W | 0MiB / 32768MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 2 Tesla V100-SXM2… On | 00000000:0A:00.0 Off | 0 |
| N/A 39C P0 45W / 300W | 3MiB / 32768MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 3 Tesla V100-SXM2… On | 00000000:0B:00.0 Off | 0 |
| N/A 39C P0 69W / 300W | 20629MiB / 32768MiB | 100% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 4 Tesla V100-SXM2… On | 00000000:85:00.0 Off | 0 |
| N/A 38C P0 57W / 300W | 13421MiB / 32768MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 5 Tesla V100-SXM2… On | 00000000:86:00.0 Off | 0 |
| N/A 38C P0 57W / 300W | 27901MiB / 32768MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 6 Tesla V100-SXM2… On | 00000000:89:00.0 Off | 0 |
| N/A 41C P0 56W / 300W | 14065MiB / 32768MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 7 Tesla V100-SXM2… On | 00000000:8A:00.0 Off | 0 |
| N/A 37C P0 57W / 300W | 9507MiB / 32768MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+

huang_wei1 · July 21, 2023, 2:39am

The“devices” flag doesn’t help, it’s still blocking as before.
When I terminate this program, it outputs as follows. I hope this information is helpful to solve the problem

==PROF== Target process 53535 terminated before first instrumented API call.
==PROF== Target process 53536 terminated before first instrumented API call.
==PROF== Target process 53537 terminated before first instrumented API call.
==PROF== Target process 53538 terminated before first instrumented API call.
==PROF== Target process 53539 terminated before first instrumented API call.
==PROF== Target process 53540 terminated before first instrumented API call.
^CWARNING:torch.distributed.elastic.agent.server.api:Received 2 death signal, shutting down workers
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 53520 closing signal SIGINT
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 53521 closing signal SIGINT
Traceback (most recent call last):
File “/opt/conda/lib/python3.8/subprocess.py”, line 1028, in communicate
stdout, stderr = self._communicate(input, endtime, timeout)
File “/opt/conda/lib/python3.8/subprocess.py”, line 1868, in _communicate
ready = selector.select(timeout)
File “/opt/conda/lib/python3.8/selectors.py”, line 415, in select
fd_event_list = self._selector.poll(timeout)
KeyboardInterrupt

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File “pretrain_gpt.py”, line 7, in
from megatron import get_args
File “/home/ubt/Megatron-LM/megatron/init.py”, line 14, in
from .initialize import initialize_megatron
File “/home/ubt/Megatron-LM/megatron/initialize.py”, line 13, in
from megatron import fused_kernels
File “/home/ubt/Megatron-LM/megatron/fused_kernels/init.py”, line 7, in
from torch.utils import cpp_extension
File “/opt/conda/lib/python3.8/site-packages/torch/utils/cpp_extension.py”, line 167, in
ROCM_HOME = _find_rocm_home()
File “/opt/conda/lib/python3.8/site-packages/torch/utils/cpp_extension.py”, line 93, in _find_rocm_home
hipcc, _ = pipe_hipcc.communicate()
File “/opt/conda/lib/python3.8/subprocess.py”, line 1039, in communicate
self._wait(timeout=sigint_timeout)
File “/opt/conda/lib/python3.8/subprocess.py”, line 1800, in _wait
time.sleep(delay)
KeyboardInterrupt
==WARNING== No kernels were profiled.

jmarusarz · July 27, 2023, 8:23pm

Can you try running without any of these flags “–kernel-name ncclKernel_AllReduce_RING_LL_Sum_uint8_t --launch-skip 1 --launch-count 1”. It could be that one of these filters isn’t detecting properly. Let’s rule that out first.

Topic		Replies	Views
Ncu does not detect kernels, ==ERROR== The application returned an error code (11) Nsight Compute kernel , profiling	6	1887	December 13, 2023
No kernels were profiled warning/problem Nsight Compute	17	10475	December 28, 2021
`ncu` "No kernels profiled" Nsight Compute	6	2296	September 29, 2022
NVIDIA NSight Compute: The profiler returned an error code:1 Nsight Compute	13	1951	March 18, 2024
Option to profile only master process Nsight Compute cuda	23	3543	December 1, 2023
==ERROR== Failed to prepare kernel for profiling (0xc00000fd) but CUDA sample works Nsight Compute kernel , nvbugs	13	2057	November 6, 2021
Run ncu command in ubuntu 20.04 Nsight Compute	7	5434	August 8, 2022
Nsight Compute: Target process terminated before first instrumented API call Nsight Compute	3	1014	February 26, 2024
Nsight Compute does not detect kernel launches for OpenMP offloaded code Nsight Compute profiling	11	1561	February 28, 2023
About using ncu to profile the python code, which further called cu kernels Nsight Compute	13	1003	June 15, 2024

Nsight-Compute returns “No kernels were profiled” warning

Related topics