Nsight-Compute returns “No kernels were profiled” warning

Hi,

I’m running my application with

ncu --kernel-name ncclKernel_AllReduce_RING_LL_Sum_uint8_t --launch-skip 1 --launch-count 1 “bash” examples/pretrain_gpt_distributed_with_mp.sh

The return message shows:

==WARNING== No kernels were profiled.
==WARNING== Profiling kernels launched by child processes requires the --target-processes all option.

I am using a V100 and run script in a docker.
Cuda version is 11.4 and ncu version is Version 2021.2.2.0…

Could you help me identify the issue?
Thanks!

When I profile a simple vector-addition code in the same environment, it works fine.
Some information is as follows:

ncu ./add_demo
==PROF== Connected to process 40693 (/home/ubt/cuda_test/add_demo)
==PROF== Profiling “matrix_add_2D” - 1: 0%…50%…100% - 20 passes
Success!
==PROF== Disconnected from process 40693
[40693] add_demo@127.0.0.1
matrix_add_2D(const unsigned int ()[1024], const unsigned int ()[1024], unsigned int (*)[1024], unsigned long, unsigned long), 2023-Jul-14 06:44:00, Context 1, Stream 7
Section: GPU Speed Of Light Throughput
---------------------------------------------------------------------- --------------- ------------------------------
DRAM Frequency cycle/usecond 870.02
SM Frequency cycle/nsecond 1.29
Elapsed Cycles cycle 72319
Memory [%] % 56.86
DRAM Throughput % 24.15
Duration usecond 56.26
L1/TEX Cache Throughput % 81.58
L2 Cache Throughput % 53.48
SM Active Cycles cycle 67114.98
Compute (SM) [%] % 4.53
---------------------------------------------------------------------- --------------- ------------------------------
WRN This kernel exhibits low compute throughput and memory bandwidth utilization relative to the peak performance
of this device. Achieved compute throughput and/or memory bandwidth below 60.0% of peak typically indicate
latency issues. Look at Scheduler Statistics and Warp State Statistics for potential reasons.

Can you try adding the “–target-processes all” flag to your initial command to see if that helps?

Yes, when I add the “–target-processes all” flag to run the script, the log output is as follows

==PROF== Target process 1266 terminated before first instrumented API call.
==PROF== Target process 1267 terminated before first instrumented API call.
==PROF== Target process 1269 terminated before first instrumented API call.
==PROF== Target process 1270 terminated before first instrumented API call.
==PROF== Target process 1271 terminated before first instrumented API call.
==PROF== Target process 1272 terminated before first instrumented API call.
==PROF== Target process 1274 terminated before first instrumented API call.
==PROF== Target process 1275 terminated before first instrumented API call.

After a long time, it seems to be blocking, with no new log output

Any ideas?

If I change the configuration, set the number of gpus in use to 1,ncu will work fine. Is there any way to use ncu to monitor when using multiple gpus

There is a “devices” flag that allows you to explicitly state which devices (via device IDs) you want to enable profiling on. You can try some experiments with this to see if it helps. What type of system are you on? Can you share the output of “nvidia-smi”?

Ubuntu 18.04.6 LTS
Cuda version is 11.4 and ncu version is Version 2021.2.2.0 in docker container.
Docker image is nvcr.io/nvidia/pytorch.
nvidia-smi output:

±----------------------------------------------------------------------------+
| NVIDIA-SMI 520.61.05 Driver Version: 520.61.05 CUDA Version: 11.8 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2… On | 00000000:06:00.0 Off | 0 |
| N/A 37C P0 58W / 300W | 0MiB / 32768MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 1 Tesla V100-SXM2… On | 00000000:07:00.0 Off | 0 |
| N/A 38C P0 42W / 300W | 0MiB / 32768MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 2 Tesla V100-SXM2… On | 00000000:0A:00.0 Off | 0 |
| N/A 39C P0 45W / 300W | 3MiB / 32768MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 3 Tesla V100-SXM2… On | 00000000:0B:00.0 Off | 0 |
| N/A 39C P0 69W / 300W | 20629MiB / 32768MiB | 100% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 4 Tesla V100-SXM2… On | 00000000:85:00.0 Off | 0 |
| N/A 38C P0 57W / 300W | 13421MiB / 32768MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 5 Tesla V100-SXM2… On | 00000000:86:00.0 Off | 0 |
| N/A 38C P0 57W / 300W | 27901MiB / 32768MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 6 Tesla V100-SXM2… On | 00000000:89:00.0 Off | 0 |
| N/A 41C P0 56W / 300W | 14065MiB / 32768MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 7 Tesla V100-SXM2… On | 00000000:8A:00.0 Off | 0 |
| N/A 37C P0 57W / 300W | 9507MiB / 32768MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+

The“devices” flag doesn’t help, it’s still blocking as before.
When I terminate this program, it outputs as follows. I hope this information is helpful to solve the problem

==PROF== Target process 53535 terminated before first instrumented API call.
==PROF== Target process 53536 terminated before first instrumented API call.
==PROF== Target process 53537 terminated before first instrumented API call.
==PROF== Target process 53538 terminated before first instrumented API call.
==PROF== Target process 53539 terminated before first instrumented API call.
==PROF== Target process 53540 terminated before first instrumented API call.
^CWARNING:torch.distributed.elastic.agent.server.api:Received 2 death signal, shutting down workers
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 53520 closing signal SIGINT
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 53521 closing signal SIGINT
Traceback (most recent call last):
File “/opt/conda/lib/python3.8/subprocess.py”, line 1028, in communicate
stdout, stderr = self._communicate(input, endtime, timeout)
File “/opt/conda/lib/python3.8/subprocess.py”, line 1868, in _communicate
ready = selector.select(timeout)
File “/opt/conda/lib/python3.8/selectors.py”, line 415, in select
fd_event_list = self._selector.poll(timeout)
KeyboardInterrupt

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File “pretrain_gpt.py”, line 7, in
from megatron import get_args
File “/home/ubt/Megatron-LM/megatron/init.py”, line 14, in
from .initialize import initialize_megatron
File “/home/ubt/Megatron-LM/megatron/initialize.py”, line 13, in
from megatron import fused_kernels
File “/home/ubt/Megatron-LM/megatron/fused_kernels/init.py”, line 7, in
from torch.utils import cpp_extension
File “/opt/conda/lib/python3.8/site-packages/torch/utils/cpp_extension.py”, line 167, in
ROCM_HOME = _find_rocm_home()
File “/opt/conda/lib/python3.8/site-packages/torch/utils/cpp_extension.py”, line 93, in _find_rocm_home
hipcc, _ = pipe_hipcc.communicate()
File “/opt/conda/lib/python3.8/subprocess.py”, line 1039, in communicate
self._wait(timeout=sigint_timeout)
File “/opt/conda/lib/python3.8/subprocess.py”, line 1800, in _wait
time.sleep(delay)
KeyboardInterrupt
==WARNING== No kernels were profiled.

Can you try running without any of these flags “–kernel-name ncclKernel_AllReduce_RING_LL_Sum_uint8_t --launch-skip 1 --launch-count 1”. It could be that one of these filters isn’t detecting properly. Let’s rule that out first.