Nsight Systems missing CUDA HW trace in Python multiprocessing subprocesses when using mp.Pool with context manager

【Description】
When analyzing a Python multiprocessing program using Nsight Systems, I found that the generated .nsys-rep file behaves differently for the main and subprocesses in the NSYS GUI:

  • The main process correctly shows CUDA activity, including CUDA API traces and CUDA HW rows (e.g., kernel execution and memory transfers).
  • However, the subprocesses (created by Python multiprocessing) do not show any CUDA HW rows, even though CPU threads and function execution are visible. This prevents visibility into actual GPU kernel executions or GPU usage within the child processes.

This issue only occurs when using with mp.Pool(...) to manage the process pool.
If the pool is created using Pool() and managed manually with close() and join() (and avoiding the use of terminate()), then the CUDA activity within the subprocesses can be correctly captured and displayed.

【Environment】

  • Nsight Systems: 2023.2.3.1004-33186433v0
  • OS: Ubuntu 20.04
  • GPU: A100 80GB PCIe
  • Driver: 535.183.01
  • CUDA: 12.2
  • Python: 3.10.16 + CuPy 13.3.0
  • Launch command: nsys profile -o trace_output python3 myscript.py
import cupy as cp
import numpy as np
import multiprocessing as mp
import os
import time
import functools

def gpu_task_with_stream(pid, size=10000):
    cp.cuda.Device(0).use() 
    stream = cp.cuda.Stream() 
    print(f"Process {pid} started on GPU {cp.cuda.Device(0).id} with stream {stream} time : {time.time()}")
    print(f"Process {pid} stream pointer: {hex(stream.ptr)}")
    print(f"Process ID: {os.getpid()}")

    with stream:  
        A = cp.random.rand(size, size, dtype=cp.float32)
        B = cp.random.rand(size, size, dtype=cp.float32)

        start_time = time.time()
        result = cp.dot(A, B)  
        stream.synchronize() 
        end_time = time.time()
    
    print(f"Process {pid} finished in {end_time - start_time:.5f} sec")

if __name__ == "__main__":
    main_start = time.time()
    mp.set_start_method("spawn")
    num_processes = 5  
    processes = []

    size = 5000
    with mp.Pool(processes=num_processes) as pool:
        task_with_params = functools.partial(gpu_task_with_stream, size=size)
        results = pool.starmap(gpu_task_with_stream, [(i,) for i in range(num_processes)])

    print("All processes with streams completed.")
    print(time.time()-main_start)


Python is notorious for calling fork without exec. By default, the tool does not trace fork without exec because the behavior is somewhat non deterministic.

You can force that trace by useing:

or by explicitly calling exec();

I have a question — since I already used mp.set_start_method("spawn") in my code, it should theoretically avoid fork-without-exec behavior. In this case, does the --trace-fork-before-exec option have any actual effect, or would it be irrelevant?

It should be irrelevant.

@Guy_Sz is our python expert.

Hi @john.yk.zhou can you please check if your issue persists on the latest version of Nsight Systems (2025.2.1)?
I tried to reproduce this issue, and I got 5 Python processes with CUDA API trace data.