Use NVIDIA DevTools Sidecar Injector with something wrong (multinode profiling)

tfmxtgx0394 · August 2, 2024, 1:36am

Hi,
I’m using NVIDIA DevTools Sidecar Injector NVIDIA DevTools Sidecar Injector in k8s environment, but something would be wrong, i can’t get document in target path. i would be appreciate if you could help me solve problem.

pretaining with k8s job (k8s environment meet the requirements)
install


helm install -f custom_values.yaml
devtools-sidecar-injector https://helm.ngc.nvidia.com/nvidia/devtools/charts/devtools-sidecar-injector-1.0.0.tgz


custom_file

# If we dont specify the Nsight image, 2024.2 version is used by default.
# Will use 2024.4 version which is planned to be released by 5/24/2024
devtoolBinariesImage:
  image: nvcr.io/nvidia/devtools/nsight-systems-cli:2024.4.1-ubuntu22.04

  imagePullPolicy: Always

profile:
  # CLI options: https://docs.nvidia.com/nsight-systems/UserGuide/index.html#cli-command-switches
  # delay and duration values in secs

  # Use %{} to include environment variables in the Nsight report filename

  # The arguments for the Nsight Systems. The placeholders will be replaced with the actual values.
  devtoolArgs: "profile --force-overwrite true --trace nvtx,cuda  --delay 50 --duration 60 \
  -o /data/data/Megatron-LM/results/nsyslog/auto_{UID}.nsys-rep"

  injectionMatch: "^/usr/bin/python /usr/local/bin/torchrun.*$"
  #injectionMatch: "^.*torchrun.*$"

labels in yaml file like this


labels:
    nvidia-devtools-sidecar-injector: enabled

pretain with a two-server cluster with four v100 in each server

NOTHING find in pods mount dir, or custom_file.yaml output path

thanks !

tfmxtgx0394 · August 2, 2024, 1:47am

something else, i‘ve had two problem in 1.0.6 version
1.


Error from server (BadRequest): container "nvidia-devtools-sidecar-injector" in pod "nvidia-devtools-sidecar-injector-56789786d6-kkp5d" is waiting to start: trying and failing to pull image

pull image name error, revise nvstaging to nvidia can repair it


Error creating: Internal error occurred: json: cannot unmarshal number into Go struct field EnvVar.spec.containers.env.value of type string

i can’t solve the problem
my yaml file can run in 1.0.0
i’m confused about it

hwilper · August 2, 2024, 2:42pm

@mpopov can you respond to this issue please.

mpopov · August 2, 2024, 5:24pm

Hi tfmxtgx0394!
Regarding the initial issue, I suggest that the problem may lie with injectionMatch. Could you apply a following ConfigMap to the namespace where the target (profiled) pod(s) exist (replace [target-namespace] with the actual namespace), and then send me back the injection.log file from inside the target pod(s)? It is better to be done before starting/re-starting the profiled pods.

kubectl create configmap nvidia-devtools-sidecar-injector-custom --from-literal=injectionconfig.yaml='{ "logOutut": "/mnt/injection.log" }'   --dry-run=client -o yaml | kubectl apply -n [target-namespace] -f -

Regarding:

Error creating: Internal error occurred: json: cannot unmarshal number into Go struct field EnvVar.spec.containers.env.value of type string

I haven’t been able to reproduce the issue yet. Could you please let me know if adding profile.env: [] and profile.defaultEnv: [] resolves the problem? For example, with your custom_values:

sidecarImage:
  image: nvcr.io/nvidia/devtools/nvidia-devtools-sidecar-injector:1.0.6

devtoolBinariesImage:
  image: nvcr.io/nvidia/devtools/nsight-systems-cli:2024.4.1-ubuntu22.04

profile:
  # CLI options: https://docs.nvidia.com/nsight-systems/UserGuide/index.html#cli-command-switches
  # delay and duration values in secs

  # Use %{} to include environment variables in the Nsight report filename

  # The arguments for the Nsight Systems. The placeholders will be replaced with the actual values.
  devtoolArgs: "profile --force-overwrite true --trace nvtx,cuda  --delay 50 --duration 60 -o /data/data/Megatron-LM/results/nsyslog/auto_{UID}.nsys-rep"

  injectionMatch: "^/usr/bin/python /usr/local/bin/torchrun.*$"

  # New values
  defaultEnv: []
  env: []

tfmxtgx0394 · August 6, 2024, 1:48am

Hi mpopov!
Thanks for your reply!
Absolutely! I have followed your recommendations and applied the suggested ConfigMap to the [target-namespace], ensuring it was in place before the profiled pods were started or restarted.

I have successfully retrieved the .nsys-rep file from the specified path and the injection.log file from within the target pod(s). injection.log files
injection.log (86.9 KB)
have been sent to you for review.

I can confirm that the issue was indeed related to injectionMatch, as you suggested. Your guidance has been crucial in resolving this problem.

Additionally, I have tested the configuration with profile.env: [] and profile.defaultEnv: [], and it has effectively addressed the issue. Your input has been invaluable in this process.

Should there be any further actions required or if you need any additional support, please do not hesitate to reach out. Your expertise and assistance have been greatly appreciated, and I look forward to any future advice you may offer.

Thank you once again for your help!

Topic		Replies	Views
Query Related to NVIDIA DevTools Sidecar Injector Profiling Linux Targets profiling	4	58	August 20, 2024
NVIDIA DevTools Sidecar Injector to using nsys will kill the container with exit status 1 Profiling Linux Targets	2	15	April 2, 2025
Mimetype is video/x-raw Segmentation fault (core dumped) DeepStream SDK tensorrt , cuda , ubuntu , deepstream	7	47	November 11, 2024
How to set custom tracker in deepstream with detection interval > 0 DeepStream SDK	7	473	September 14, 2023
[QuadDCommon::tag_message*] = No GPU associated to the given UUID Profiling Linux Targets	24	946	November 5, 2024
Nsys cannot collect cuda information on Drive OS 5.1 DRIVE AGX Xavier General drive-devtools	62	3884	October 12, 2021
Tutorial: How to run YOLOv7 on Deepstream DeepStream SDK demos-and-tutorials	19	6893	March 26, 2024
Cannot load built engine resnet50_market1501_aicity156 DeepStream SDK nvbugs	53	1719	February 14, 2025
Failed to initialize the NVIDIA graphics device! Jetson Xavier NX	45	3923	November 9, 2022
Nsight Systems fails to start the application - full logs and screenshots provided Profiling x86 Windows Targets	3	1548	October 20, 2023

Use NVIDIA DevTools Sidecar Injector with something wrong (multinode profiling)

Related topics