Hi everyone, I’ve been attempting to profile unified memory operations during distributed deep learning training.
However, I’m observing some odd behavior and would like some clarification.
This is my code:
import torch
import torch.distributed as dist
from managed_alloc import managed_alloc
from helper_cupti_um import setup_cupti_um, free_cupti_um
# Change the CUDA memory allocator used by PyTorch
managed_alloc()
dist.init_process_group(backend='nccl')
# Initialize cupti-python to record unified memory operations
setup_cupti_um()
a_0 = torch.randn(1, device='cuda:0')
a_0 += 1
# Release cupti-python resources
free_cupti_um()
When I run this script, I get two records:
UNIFIED_MEMORY_COUNTER [ 3507779023841907865, 0 ] duration -3507779023841907865, counter_kind GPU_PAGE_FAULT, value 1, address 139980466814976, src_id 0, dst_id 0, process_id 1820, flags WRITE
UNIFIED_MEMORY_COUNTER [ 1753960814830521887, 1753960814830606816 ] duration 84929, counter_kind GPU_PAGE_FAULT, value 1, address 140697726353408, src_id 0, dst_id 0, process_id 1819, flags WRITE
This is unexpected: I thought there would be only one record, but the addresses are different, so it doesn’t seem like a duplicate.
Here are my questions:
- For a multi-process application, should I initialize
cupti-pythononly once?
- When I initialize
cupti-pythonjust once, the above code works as expected. However, during LLM training with multiple GPUs and processes, I only see records for GPU 0, which seems odd.
- How does CUPTI profile and trace in a multi-process environment?
If anyone can clarify these behaviors, I would really appreciate it!