[problem] Using nsys profile stuck at Collecting Data stage

1793030808 · April 20, 2023, 1:40am

I am using nsys profile to measure the performance of my program. My program can run on 2 nodes, with 2 A100 GPUs each. However, it is stuck at the Collecting Data stage. I have also tried running some simple programs and programs that only run on a single node with nsys profile, and they were able to pass the Collecting Data stage.

The running command is “sh run.sh”. The program code in the run.sh file is:

#! /bin/bash
NODE_RANK=0
NNODES=2
GPUS_PER_NODE=2
MASTER_PORT=22
MASTER_ADDR=10.x.x.19
# ib ip = 10.4.9.19
# gpu6 ip = 10.254.46.19
DISTRIBUTED_ARGS="--nproc_per_node $GPUS_PER_NODE --nnodes $NNODES --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT"

DATETIME=`date +'date_%y-%m-%d_time_%H-%M-%S'`
TENSORBOARD_PATH=./tensorboard/$DATETIME

config_json="deepspeed.json"
VOCAB_FILE=vocab.txt
# DATASET_1="/nfs/data001/001.txt_document_context"
DATASET_2="xxx/002.txt_document_context"
DATASET_3="xxx/003.txt_document_context"
DATASET_4="xxx/004.txt_document_context"
DATASET_5="xxx/005.txt_document_context"
DATASET_6="xxx/006.txt_document_context"
DATASET_7="xxx/007.txt_document_context"
DATASET_8="xxx/008.txt_document_context"
DATASET_9="xxx/009.txt_document_context"
DATASET_10="xxx/010.txt_document_context"
DATASET_11="xxx/011.txt_document_context"
DATASET_12="xxx/012.txt_document_context"
DATASET_13="xxx/013.txt_document_context"
DATASET_14="xxx/014.txt_document_context"
DATASET_15="xxx/015.txt_document_context"
DATASET_16="xxx/016.txt_document_context"
DATASET_17="xxx/017.txt_document_context"
DATASET_18="xxx/018.txt_document_context"
DATA_PATH=" \
        0.1 ${DATASET_11} \
        0.1 ${DATASET_2} \
        0.1 ${DATASET_3} \
        0.1 ${DATASET_4} \
        0.1 ${DATASET_5} \
        0.1 ${DATASET_6} \
        0.1 ${DATASET_7} \
        0.1 ${DATASET_8} \
        0.1 ${DATASET_9} \
        0.1 ${DATASET_10}"

gpt_options="\
        --distributed-backend nccl \
        --tokenizer-type EncDecTokenizer \
        --optimizer lamb \
        --lr-decay-style cosine \
        --vocab-file $VOCAB_FILE \
        --tensor-model-parallel-size 1 \
        --num-layers 40 \
        --train-samples 10024 \
        --hidden-size 3072 \
        --num-attention-heads 24 \
        --seq-length 2048 \
        --max-position-embeddings 2048 \
        --micro-batch-size 1 \
        --global-batch-size 8 \
        --lr-warmup-samples 1024 \
        --lr 5e-3 \
        --min-lr 1e-05 \
        --weight-decay 0.0005 \
        --adam-beta1 0.9 \
        --adam-beta2 0.95 \
        --log-interval 1 \
        --eval-iters -1 \
        --data-path ${DATA_PATH} \
        --save-interval 2000 \
        --split 100,0,0 \
        --init-method-std 0.002 \
        --fp16 \
        --DDP-impl local \
        --checkpoint-num-layers 1 \
        --log-num-zeros-in-grad \
        --log-params-norm \
        --tensorboard-dir $TENSORBOARD_PATH \
        --tensorboard-log-interval 1 \
        --num-workers 8 \
        --pipeline-model-parallel-size 4
"
deepspeed_options=" \
        --deepspeed \
        --deepspeed_config ${config_json} \
        --zero-stage 1 \
        --deepspeed-activation-checkpointing
"
full_options="${gpt_options} ${deepspeed_options}"
nsys profile -w true deepspeed --hostfile="xxx/hostfile" --num_nodes ${NNODES} --num_gpus ${GPUS_PER_NODE}  ./pretrain_gpt.py ${full_options}

Sanjiv.Satoor · April 20, 2023, 4:15am

Moved to the Nsight Systems category.

hwilper · April 20, 2023, 2:27pm

Lets try cutting down on what you are collecting and see if that helps.

“nsys profile” will collect CUDA, NVTX, and OS runtime traces as well as sampling CPU backtraces.

Depending on what you need, you might want to drop the backtraces and OSRT with

nsys profile --trace=cuda,nvtx --sample=none

You might also want to consider putting a duration in there, like (-d 20) and/or a delay (-y 20) so that the run is shorter. Usually for performance problems the patterns repeat over and over and you only need to catch a few.

How long is the application currently running for?

system · May 4, 2023, 2:27pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
[problem] Nsight System cannot collect program performance data in a multi-node distributed environment Profiling Linux Targets	4	854	April 20, 2023
Error Collecting Nsys Profile Metrics Profiling Linux Targets nsight	3	701	April 18, 2024
Nsys profile linux application, but apllication coredump Profiling Linux Targets	6	479	April 24, 2024
Nsys hangs when profile cuda applications Profiling Linux Targets	10	930	March 8, 2024
Nsys hangs when profiling any cuda process Profiling Linux Targets cuda	0	253	March 4, 2024
Can't profile deepstream test1 (python) using nsys DeepStream SDK profiling	8	1084	September 8, 2022
Multi Node Profiling with Nsight Systems Profiling Linux Targets	7	1034	July 8, 2024
How to get full profiling with Nsight system for a particular process Profiling Linux Targets cudnn	8	1721	September 23, 2024
For a problem where duration on Nsys report and the actual application runtime are too different Profiling Linux Targets	8	1083	April 26, 2024
Nsys can't capture anything (cuda programs only) Profiling Linux Targets	14	101	July 10, 2025

[problem] Using nsys profile stuck at Collecting Data stage

Related topics