Tracking distributed training with cupti-python, pytorch

tmdrl583205 · July 31, 2025, 11:29am

Hi everyone, I’ve been attempting to profile unified memory operations during distributed deep learning training.

However, I’m observing some odd behavior and would like some clarification.

This is my code:

import torch
import torch.distributed as dist

from managed_alloc import managed_alloc
from helper_cupti_um import setup_cupti_um, free_cupti_um

# Change the CUDA memory allocator used by PyTorch
managed_alloc()

dist.init_process_group(backend='nccl')

# Initialize cupti-python to record unified memory operations
setup_cupti_um()

a_0 = torch.randn(1, device='cuda:0')
a_0 += 1

# Release cupti-python resources
free_cupti_um()

When I run this script, I get two records:

UNIFIED_MEMORY_COUNTER [ 3507779023841907865, 0 ] duration -3507779023841907865, counter_kind GPU_PAGE_FAULT, value 1, address 139980466814976, src_id 0, dst_id 0, process_id 1820, flags WRITE
UNIFIED_MEMORY_COUNTER [ 1753960814830521887, 1753960814830606816 ] duration 84929, counter_kind GPU_PAGE_FAULT, value 1, address 140697726353408, src_id 0, dst_id 0, process_id 1819, flags WRITE

This is unexpected: I thought there would be only one record, but the addresses are different, so it doesn’t seem like a duplicate.

Here are my questions:

For a multi-process application, should I initialize cupti-python only once?

When I initialize cupti-python just once, the above code works as expected. However, during LLM training with multiple GPUs and processes, I only see records for GPU 0, which seems odd.

How does CUPTI profile and trace in a multi-process environment?

If anyone can clarify these behaviors, I would really appreciate it!

Topic		Replies	Views
Is there a problem in my cupti-python code for unified memory counter? CUPTI – CUDA Profiler Tools Interface	5	176	July 29, 2025
Unified Memory Counter output is something strange CUPTI – CUDA Profiler Tools Interface	2	125	July 23, 2025
I can't understand the behavior of CUPTI CUPTI – CUDA Profiler Tools Interface	0	128	July 14, 2025
Memory leakage at the termination of CUPTI profiling session CUPTI – CUDA Profiler Tools Interface	8	921	September 13, 2024
NSight Compute CUPTI_ERROR_MULTIPLE_SUBSCRIBERS_NOT_SUPPORTED Nsight Compute cudnn	5	1261	January 29, 2024
Profiling application with CUPTI in a separate process? CUDA Programming and Performance	2	917	July 6, 2017
cuptiSubscribe: error 39: CUPTI_ERROR_MULTIPLE_SUBSCRIBERS_NOT_SUPPORTED CUDA Programming and Performance	3	230	August 7, 2024
Strange difference between CUPTI results and nvprof CUPTI – CUDA Profiler Tools Interface	3	1255	December 8, 2019
CUPTI_ERROR_NOT_INITIALIZED is cancelling my brain on MS WSL2 CUDA on Windows Subsystem for Linux	1	2495	October 30, 2024
Collecting events and metrics with CUPTI from a separate process? CUDA Programming and Performance	2	550	July 6, 2017

Tracking distributed training with cupti-python, pytorch

Related topics