Nsight Systems missing CUDA HW trace in Python multiprocessing subprocesses when using mp.Pool with context manager

john.yk.zhou · March 24, 2025, 4:52pm

【Description】
When analyzing a Python multiprocessing program using Nsight Systems, I found that the generated .nsys-rep file behaves differently for the main and subprocesses in the NSYS GUI:

The main process correctly shows CUDA activity, including CUDA API traces and CUDA HW rows (e.g., kernel execution and memory transfers).
However, the subprocesses (created by Python multiprocessing) do not show any CUDA HW rows, even though CPU threads and function execution are visible. This prevents visibility into actual GPU kernel executions or GPU usage within the child processes.

This issue only occurs when using with mp.Pool(...) to manage the process pool.
If the pool is created using Pool() and managed manually with close() and join() (and avoiding the use of terminate()), then the CUDA activity within the subprocesses can be correctly captured and displayed.

【Environment】

Nsight Systems: 2023.2.3.1004-33186433v0
OS: Ubuntu 20.04
GPU: A100 80GB PCIe
Driver: 535.183.01
CUDA: 12.2
Python: 3.10.16 + CuPy 13.3.0
Launch command: nsys profile -o trace_output python3 myscript.py

import cupy as cp
import numpy as np
import multiprocessing as mp
import os
import time
import functools

def gpu_task_with_stream(pid, size=10000):
    cp.cuda.Device(0).use() 
    stream = cp.cuda.Stream() 
    print(f"Process {pid} started on GPU {cp.cuda.Device(0).id} with stream {stream} time : {time.time()}")
    print(f"Process {pid} stream pointer: {hex(stream.ptr)}")
    print(f"Process ID: {os.getpid()}")

    with stream:  
        A = cp.random.rand(size, size, dtype=cp.float32)
        B = cp.random.rand(size, size, dtype=cp.float32)

        start_time = time.time()
        result = cp.dot(A, B)  
        stream.synchronize() 
        end_time = time.time()
    
    print(f"Process {pid} finished in {end_time - start_time:.5f} sec")

if __name__ == "__main__":
    main_start = time.time()
    mp.set_start_method("spawn")
    num_processes = 5  
    processes = []

    size = 5000
    with mp.Pool(processes=num_processes) as pool:
        task_with_params = functools.partial(gpu_task_with_stream, size=size)
        results = pool.starmap(gpu_task_with_stream, [(i,) for i in range(num_processes)])

    print("All processes with streams completed.")
    print(time.time()-main_start)

hwilper · March 25, 2025, 1:23pm

Python is notorious for calling fork without exec. By default, the tool does not trace fork without exec because the behavior is somewhat non deterministic.

You can force that trace by useing:

or by explicitly calling exec();

john.yk.zhou · March 26, 2025, 11:26am

Launch command: nsys profile -o trace_output2 --trace-fork-before-exec=true python3 myscript.py

截圖 2025-03-26 晚上7.07.072474×1029 312 KB

I have a question — since I already used mp.set_start_method("spawn") in my code, it should theoretically avoid fork-without-exec behavior. In this case, does the --trace-fork-before-exec option have any actual effect, or would it be irrelevant?

hwilper · March 26, 2025, 5:05pm

It should be irrelevant.

@Guy_Sz is our python expert.

Guy_Sz · April 3, 2025, 6:30pm

Hi @john.yk.zhou can you please check if your issue persists on the latest version of Nsight Systems (2025.2.1)?
I tried to reproduce this issue, and I got 5 Python processes with CUDA API trace data.

Topic		Replies	Views
NSight Systems does not profile subprocess(via fork in unistd or Process in python.multiprocess) CUDA_API Profiling Linux Targets	6	1248	September 23, 2024
Not getting NVTX events from child processes on Linux Profiling Linux Targets	8	1222	June 23, 2021
How to get full profiling with Nsight system for a particular process Profiling Linux Targets cudnn	8	1247	September 23, 2024
'cuda HW' field is missing Profiling Linux Targets nsight	6	43	January 9, 2025
[problem] Nsight System cannot collect program performance data in a multi-node distributed environment Profiling Linux Targets	4	829	April 20, 2023
nsys CUDA trace works for threads, but not for subprocesses Profiling Linux Targets	3	2339	May 13, 2019
When I run nsight system with cli command, what can I do to solve the compiling error? Profiling x86 Windows Targets cuda	9	1038	May 5, 2023
No CUDA kernels shown in nsys profiler timeline when using dynamic parallelism Nsight Systems cuda , kernel , nsight	4	1438	January 7, 2021
Nsys Profile with MPMD(multiple program and multiple data) simulation Profiling Linux Targets nsight , openmpi	6	1510	May 20, 2021
Nsight-system can't recognize the conda enviroment when profile the application Profiling Linux Targets cuda	4	1150	March 2, 2023

Nsight Systems missing CUDA HW trace in Python multiprocessing subprocesses when using mp.Pool with context manager

Related topics