Nsight System doesn't create qdrep file

This is my first time using nsys. When I ran nsys on a single GPU using the Docker image I created, I confirmed that the results were generated successfully. However, it behaves strangely in a multi-GPU environment.

  • Ubuntu 18.04
  • Nsight version 2020.3
  • The backend is using NCCL.
  • 2080Ti * 3, Titan RTX * 1 (Each single node has 4 GPUs)
  • I am conducting DL training using the Parameter-Server method in a Kubernetes environment, and I am trying to collect GPU logs inside the Container.
  • My goal is to collect logs in a multi-node environment, but I’d be happy if it worked even on a single node.

When running the simplest command:

nsys profile -o /result/id0_cifar10_densenet100_k12_sync_batch32/id0_cifar10_densenet100_k12_sync_batch32_${JOB}_nsys.qdrep

The Kubernetes Pod does not crash, but no qdrep file is generated. The job runs fine without the nsys command. Additionally, it stays in the “Processing Events…” state for more than an hour.

When running with additional options like this:

nsys profile --trace=cuda,nvtx --duration=120 --delay=60 -o /result/id0_cifar10_densenet100_k12_sync_batch32/id0_cifar10_densenet100_k12_sync_batch32_${JOB}_nsys.qdrep

The Worker Pod terminates earlier than usual, while the Parameter Server Pod remains running. A qdrep file is generated, but it looks abnormal or corrupted.

What am I doing wrong? Here’s the job yaml file for the Kubernetes Worker I created:

apiVersion: kubeflow.org/v1
kind: "TFJob"
metadata:
  name: id0-cifar10-densenet100-k12-sync-batch32
spec:
  runPolicy:
    cleanPodPolicy: None
  tfReplicaSpecs:
    WORKER:
      replicas: 2
      template:
        spec:
          containers:
            - name: tensorflow
              command: ["/bin/sh", "-c"]
              args:
                - cd /tf_cnn_benchmarks/NVML;
                  make;
                  mkdir -p /result/id0_cifar10_densenet100_k12_sync_batch32;
                  JOB=`python /tf_cnn_benchmarks/job_name.py`;
                  top -d 0.1 -b | grep tf_cnn > /result/id0_cifar10_densenet100_k12_sync_batch32/id0_cifar10_densenet100_k12_sync_batch32_${JOB}_cpu.txt &
                  echo "id0_cifar10_densenet100_k12_sync_batch32" > /tf_cnn_benchmarks/model.txt;
                  STARTTIME=`date "+%H:%M:%S.%N"`;
                  echo "$STARTTIME" > /result/id0_cifar10_densenet100_k12_sync_batch32/id0_cifar10_densenet100_k12_sync_batch32_${JOB}_start_time.txt;
                  nsys profile -o /result/id0_cifar10_densenet100_k12_sync_batch32/id0_cifar10_densenet100_k12_sync_batch32_${JOB}_nsys.qdrep python /tf_cnn_benchmarks/tf_cnn_benchmarks.py --variable_update=parameter_server --model=densenet100_k12 --data_name=cifar10 --display_every=1 --batch_size=16 --cross_replica_sync=true --num_batches=1000 --num_warmup_batches=0 > /result/id0_cifar10_densenet100_k12_sync_batch32/id0_cifar10_densenet100_k12_sync_batch32_${JOB}_log.txt;
                  ENDTIME=`date "+%H:%M:%S.%N"`;
                  echo "$ENDTIME" > /result/id0_cifar10_densenet100_k12_sync_batch32/id0_cifar10_densenet100_k12_sync_batch32_${JOB}_end_time.txt
              ports:
                - containerPort: 2222
                  name: tfjob-port
              image: -
              imagePullPolicy: IfNotPresent
              resources:
                requests:
                  cpu: 10m
                  nvidia.com/gpu: 1
                  ephemeral-storage: "50Gi"
                limits:
                  cpu: 5
                  nvidia.com/gpu: 1
              volumeMounts:
                - mountPath: /result
                  name: tfjob-data
          volumes:
            - name: tfjob-data
              persistentVolumeClaim:
                claimName: tfjob-data-volume-claim
          nodeSelector:
            twonode: worker

Thank you in advance for your help. I appreciate any advice or guidance you can provide!

First, that version of Nsys is really old, we moved to having .nsys-rep files instead of .qdrep files some time ago. My first suggestion is to update to the 2024.5 version.

Secondly, how long did you do the collection for? Being stuck processing for a long time tells me that you probably collected a massive amount of data, probably by having the collection run for overly long.

Without using nsys, the job typically takes around 5 minutes to complete, but when I added the nsys command, it stayed in the “Processing Events…” state for over 120 minutes. There was no progress shown in a status bar or anything similar. Since I couldn’t wait any longer, I had to terminate it.

I have been considering various actions to resolve this issue. As I want to avoid any environment conflicts, updating to version 2024.5 would be my last resort if possible. Before that, would using the NVIDIA DevTools Sidecar Injector help in my situation?

Thank you for your response.

The “Processing Events” stage is the phase where Nsys is taking the data from the run and converting it into the final results file. This is very RAM intensive.

May I ask how big the .qdstrm file is?

I’m not sure what environment conflicts you are worried about, but all versions of Nsys can be installed side-by-side, so there is no reason you couldn’t keep the current version on the system while adding the new version and trying it.

The DevTools Sidecar Injector would probably be the right tool, however it was just shipped with 2024.3 in March, and don’t know if it would work with your old version. I can’t think of any specific reason it wouldn’t, but 4 1/2 years is a long time in terms of software changes.

I’m going to loop in @mhallock in case he has thoughts about your second case above, where the worker pod terminates before the rest.

Also, are you putting all the data in one file, or a file per pod?

I monitored the path where the results are saved, and no .qdstrm file was generated—only the .qdrep file was created. In hindsight, this seems unusual.

In the previous case I mentioned (where it ran for more than 120 minutes), I checked the last modification time of the .qdrep file using stat, and it was about 20 minutes after the pod started. From what I understand, it seems the file wasn’t being written correctly. Additionally, although I don’t remember the exact details since the experiment was conducted a few days ago, the most baffling case had a .qdrep file that was only 3KB. When I changed parameters (e.g., duration, etc.) and ran tests, the .qdrep file that initially appeared “normal” was around 7MB in size, but in reality, it wasn’t functioning correctly either.

For now, as you suggested earlier, I am in the process of updating the versions. I’m switching to the 24.08-tf2-py3 version of TensorFlow and also updating nsys to version 2024.5. I’m hoping for good results from this.

As for your second question, I’m not sure if I understood it correctly. If you’re referring to the dataset, it was Data Parallel, but the original data was available to each pod individually. Simply put, all the worker pods were using the same Docker image, which included the entire dataset. However, the parameter server was using a different Docker image, and I will check this part and get back to you.

Greetings,

As you’ve found out the hard way, the duration/delay combination of options for nsys profile is not very compatible with container workloads - the behavior is that after the duration completes, the child process being profiled is killed (the assumption being that it was only running for the sake of profiling anyways). There is a --kill=none flag you can add to prevent it from being axed, but it reparents the process and the nsys profile process exits, which kubernetes sees as the pod exiting and tears it down regardless of the fact the app is actually still running.

It is awkward, but here is a workaround approach:

nsys launch -t cuda,nvtx python [...script options...] & sleep 60 && nsys start -o [output name] && sleep 120 && nsys stop;
wait;

The process that actually writes the output is not the nsys profile process, so it is possible that it is being killed before finishing the output, so that may explain the corrupt files you are seeing. I think modern versions of nsys should wait for the output to settle before nsys profile exists, not sure if the version you are using behaves differently.

As Holly mentioned, nsys is very self-contained, having multiple versions installed is not an issue. As you mentioned, yes, using the sidecar injector is totally an option to have the tooling provided at runtime and not needing to be included in your image, but still suffers from the same issue as above.

I believe Holly’s data question was to make sure that you do not have a conflict where your pods are potentially trying to each write to the same file; assuming that your python /tf_cnn_benchmarks/job_name.py for $JOB is unique per replica, this is fine. If not, then make sure that each instance of nsys is getting its own unique output path.

1 Like

Thank you for your feedback! Following your advice, I was able to successfully generate a .qdrep file using the command:

nsys launch -t cuda,nvtx python [...script options...] & sleep 60 && nsys start -o [output name] && sleep 120 && nsys stop;
wait;

(This was done with Nsight Systems version 2020.3.)

I have a small follow-up question. After extracting the results with the above command, when I check the GPU description in the Analysis Summary of Nsight Systems, it only shows 2080 Ti. My system consists of 3 * 2080 Ti and 1 * Titan RTX. Of course, I have confirmed via nvidia-smi during the job execution that all four GPUs were working.

Is there a way to obtain analysis results for all four GPUs? Or could it be that I made a mistake in entering the command?
Here’s the command I executed:

WORKER:
  replicas: 2
  template:
    spec:
      containers:
      - args:
        - cd /tf_cnn_benchmarks/NVML; make; JOB=`python /tf_cnn_benchmarks/job_name.py`;
          CONTROLLER_HOST=`python -c "import os, json; tf_config = json.loads(os.environ.get('TF_CONFIG')
          or '{}'); cluster_config = tf_config.get('cluster', {}); controller_host
          = cluster_config.get('controller'); print(','.join(controller_host))"`;
          mkdir -p /result/id0_cifar10_densenet100_k12_sync_batch32; top -d 0.1
          -b | grep tf_cnn > /result/id0_cifar10_densenet100_k12_sync_batch32/id0_cifar10_densenet100_k12_sync_batch32_${JOB}_cpu.txt
          & echo "id0_cifar10_densenet100_k12_sync_batch32" > /tf_cnn_benchmarks/model.txt;
          STARTTIME=`date "+%H:%M:%S.%N"`; echo "$STARTTIME" > /result/id0_cifar10_densenet100_k12_sync_batch32/id0_cifar10_densenet100_k12_sync_batch32_${JOB}_start_time.txt;
          nsys launch -t cuda,nvtx python /tf_cnn_benchmarks/tf_cnn_benchmarks.py --variable_update=distributed_all_reduce
          --model=densenet100_k12 --data_name=cifar10 --display_every=1 --batch_size=16
          --cross_replica_sync=true --num_batches=1000 --num_warmup_batches=0  --controller_host=${CONTROLLER_HOST}
          --all_reduce_spec=nccl/xring > /result/id0_cifar10_densenet100_k12_sync_batch32/id0_cifar10_densenet100_k12_sync_batch32_${JOB}_log.txt & sleep 1 && nsys start -o id0_cifar10_densenet100_k12_sync_batch32_${JOB}_log.qdrep && sleep 600 && nsys stop;
          wait;
          ENDTIME=`date "+%H:%M:%S.%N"`; echo "$ENDTIME" > /result/id0_cifar10_densenet100_k12_sync_batch32/id0_cifar10_densenet100_k12_sync_batch32_${JOB}_end_time.txt
        command:
        - /bin/sh
        - -c

Your feedback has been incredibly helpful to me. I sincerely appreciate it.

As per your original spec:

You are only requesting K8s to schedule one GPU for each pod, so each report is only going to have visibility to the GPU that the pod was assigned.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.