Yes, when I add the “–target-processes all” flag to run the script, the log output is as follows
==PROF== Target process 1266 terminated before first instrumented API call.
==PROF== Target process 1267 terminated before first instrumented API call.
==PROF== Target process 1269 terminated before first instrumented API call.
==PROF== Target process 1270 terminated before first instrumented API call.
==PROF== Target process 1271 terminated before first instrumented API call.
==PROF== Target process 1272 terminated before first instrumented API call.
==PROF== Target process 1274 terminated before first instrumented API call.
==PROF== Target process 1275 terminated before first instrumented API call.
After a long time, it seems to be blocking, with no new log output
There is a “devices” flag that allows you to explicitly state which devices (via device IDs) you want to enable profiling on. You can try some experiments with this to see if it helps. What type of system are you on? Can you share the output of “nvidia-smi”?
Ubuntu 18.04.6 LTS
Cuda version is 11.4 and ncu version is Version 2021.2.2.0 in docker container.
Docker image is nvcr.io/nvidia/pytorch.
nvidia-smi output:
The“devices” flag doesn’t help, it’s still blocking as before.
When I terminate this program, it outputs as follows. I hope this information is helpful to solve the problem
==PROF== Target process 53535 terminated before first instrumented API call.
==PROF== Target process 53536 terminated before first instrumented API call.
==PROF== Target process 53537 terminated before first instrumented API call.
==PROF== Target process 53538 terminated before first instrumented API call.
==PROF== Target process 53539 terminated before first instrumented API call.
==PROF== Target process 53540 terminated before first instrumented API call.
^CWARNING:torch.distributed.elastic.agent.server.api:Received 2 death signal, shutting down workers
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 53520 closing signal SIGINT
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 53521 closing signal SIGINT
Traceback (most recent call last):
File “/opt/conda/lib/python3.8/subprocess.py”, line 1028, in communicate
stdout, stderr = self._communicate(input, endtime, timeout)
File “/opt/conda/lib/python3.8/subprocess.py”, line 1868, in _communicate
ready = selector.select(timeout)
File “/opt/conda/lib/python3.8/selectors.py”, line 415, in select
fd_event_list = self._selector.poll(timeout)
KeyboardInterrupt
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File “pretrain_gpt.py”, line 7, in
from megatron import get_args
File “/home/ubt/Megatron-LM/megatron/init.py”, line 14, in
from .initialize import initialize_megatron
File “/home/ubt/Megatron-LM/megatron/initialize.py”, line 13, in
from megatron import fused_kernels
File “/home/ubt/Megatron-LM/megatron/fused_kernels/init.py”, line 7, in
from torch.utils import cpp_extension
File “/opt/conda/lib/python3.8/site-packages/torch/utils/cpp_extension.py”, line 167, in
ROCM_HOME = _find_rocm_home()
File “/opt/conda/lib/python3.8/site-packages/torch/utils/cpp_extension.py”, line 93, in _find_rocm_home
hipcc, _ = pipe_hipcc.communicate()
File “/opt/conda/lib/python3.8/subprocess.py”, line 1039, in communicate
self._wait(timeout=sigint_timeout)
File “/opt/conda/lib/python3.8/subprocess.py”, line 1800, in _wait
time.sleep(delay)
KeyboardInterrupt
==WARNING== No kernels were profiled.
Can you try running without any of these flags “–kernel-name ncclKernel_AllReduce_RING_LL_Sum_uint8_t --launch-skip 1 --launch-count 1”. It could be that one of these filters isn’t detecting properly. Let’s rule that out first.