Tracking distributed training with cupti-python, pytorch

Hi everyone, I’ve been attempting to profile unified memory operations during distributed deep learning training.

However, I’m observing some odd behavior and would like some clarification.

This is my code:

import torch
import torch.distributed as dist

from managed_alloc import managed_alloc
from helper_cupti_um import setup_cupti_um, free_cupti_um

# Change the CUDA memory allocator used by PyTorch
managed_alloc()

dist.init_process_group(backend='nccl')

# Initialize cupti-python to record unified memory operations
setup_cupti_um()

a_0 = torch.randn(1, device='cuda:0')
a_0 += 1

# Release cupti-python resources
free_cupti_um()

When I run this script, I get two records:

UNIFIED_MEMORY_COUNTER [ 3507779023841907865, 0 ] duration -3507779023841907865, counter_kind GPU_PAGE_FAULT, value 1, address 139980466814976, src_id 0, dst_id 0, process_id 1820, flags WRITE
UNIFIED_MEMORY_COUNTER [ 1753960814830521887, 1753960814830606816 ] duration 84929, counter_kind GPU_PAGE_FAULT, value 1, address 140697726353408, src_id 0, dst_id 0, process_id 1819, flags WRITE

This is unexpected: I thought there would be only one record, but the addresses are different, so it doesn’t seem like a duplicate.

Here are my questions:

  1. For a multi-process application, should I initialize cupti-python only once?
  • When I initialize cupti-python just once, the above code works as expected. However, during LLM training with multiple GPUs and processes, I only see records for GPU 0, which seems odd.
  1. How does CUPTI profile and trace in a multi-process environment?

If anyone can clarify these behaviors, I would really appreciate it!