I’d like to get the waiting time of each kernel launched in subprocess(For analysis MPS).
Nsight system works ok when I directly launch the kernel in the main process.
However, subprocesses launched through multiprocess in PyTorch (versions 1.5, 1.13, 2.0) cannot display CUDA calls in nsight system. When checking with nvidia-smi, it shows that these subprocesses are indeed occupying GPU memory. Similarly, compiling a simple matrix addition kernel using nvcc and launching it through a subprocess created by fork also fails to display CUDA calls in nsight system.
The issue described is easily reproducible and seems to be unrelated to whether MPS is enabled or not.
Version Information:
ncu:
Version 2023.1.1.0 (build 32678585) (public-release)
nvcc:
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Mon_Apr__3_17:16:06_PDT_2023
Cuda compilation tools, release 12.1, V12.1.105
Build cuda_12.1.r12.1/compiler.32688072_0
python: Python 3.8.16
Nsight System Launch:
By UI:
Result:
No CUDA events collected. Does the process use CUDA?
By CLI:
ncu --target-processes all -o profile python3 driver.py --job resnet50
Result:
==WARNING== No kernels were profiled
Launch Code
for idx, job in enumerate(job_list):
p = None
p = Process(target=dummy_launcher, args=(job_dict[job], float(wait_dict[job]), barrier, ))
p.start()
assert p is not None # During the test, p has not been None
childprocess_list.append(p)
for p in childprocess_list:
p.join()
for the CLI that’s a Nsight Compute command line.
I would expect you would have something like:
nsys profile -o profile python3 driver.py --job resnet50
Process-tree wide is on by default from the CLI
After executing the command and read the output report in system, no CUDA API is shown
use multiprocess.nsys-rep.zip (7.4 MB)
If I choose to launch job directly(not by multiprocess), CUDA API is shown
not-multiprocess.nsys-rep.zip (7.8 MB)
Profile command is same. (nsys profile -t cuda -o profile python3 driver.py --job resnet50
)
Are you forking w/o exec?
–trace-fork-before-exec
true, false
default - false
If true, trace any child process after fork and before they call one of the exec functions. Beware, tracing in this interval relies on undefined behavior and might cause your application to crash or deadlock. Note: This option is only available on Linux target platforms.
It also does not work.
profile3.nsys-rep.zip (7.5 MB)
And I find another special problem: the forked process cannot use CUDA backend when using nsight systems. Is there are something special when using nsight systems for profiling?
@afroger Antoine, can you do a deeper dive on this one?
@afroger @hwilper @grad-first-change-world
Am also facing a similar issue, where the python subprocess being launched as fork are conducting some mempcy transfers along with some kernel operations, but nsys is not reporting those.
By any chance, were we able to resolve the above?
nsys command being used:
nsys profile --trace cuda -o arxiv_gpu_shm --force-overwrite true --trace-fork-before-exec true python node_clas.py --dataset ogbn-arxiv --epoch 1
I have tried with two nsys versions as below:
NVIDIA Nsight Systems version 2024.2.1.106-242134037904v0
NVIDIA Nsight Systems version 2022.1.3.3-1c7b5f7
Attached file for ref:
arxiv_gpu_shm.nsys-rep.zip (489.4 KB)