Hi,
I’m using NVIDIA DevTools Sidecar Injector NVIDIA DevTools Sidecar Injector in k8s environment, but something would be wrong, i can’t get document in target path. i would be appreciate if you could help me solve problem.
pretaining with k8s job (k8s environment meet the requirements)
custom_file
# If we dont specify the Nsight image, 2024.2 version is used by default.
# Will use 2024.4 version which is planned to be released by 5/24/2024
devtoolBinariesImage:
image: nvcr.io/nvidia/devtools/nsight-systems-cli:2024.4.1-ubuntu22.04
imagePullPolicy: Always
profile:
# CLI options: https://docs.nvidia.com/nsight-systems/UserGuide/index.html#cli-command-switches
# delay and duration values in secs
# Use %{} to include environment variables in the Nsight report filename
# The arguments for the Nsight Systems. The placeholders will be replaced with the actual values.
devtoolArgs: "profile --force-overwrite true --trace nvtx,cuda --delay 50 --duration 60 \
-o /data/data/Megatron-LM/results/nsyslog/auto_{UID}.nsys-rep"
injectionMatch: "^/usr/bin/python /usr/local/bin/torchrun.*$"
#injectionMatch: "^.*torchrun.*$"
labels in yaml file like this
labels:
nvidia-devtools-sidecar-injector: enabled
pretain with a two-server cluster with four v100 in each server
NOTHING find in pods mount dir, or custom_file.yaml output path
something else, i‘ve had two problem in 1.0.6 version
1.
Error from server (BadRequest): container "nvidia-devtools-sidecar-injector" in pod "nvidia-devtools-sidecar-injector-56789786d6-kkp5d" is waiting to start: trying and failing to pull image
pull image name error, revise nvstaging to nvidia can repair it
Error creating: Internal error occurred: json: cannot unmarshal number into Go struct field EnvVar.spec.containers.env.value of type string
i can’t solve the problem
my yaml file can run in 1.0.0
i’m confused about it
Hi tfmxtgx0394!
Regarding the initial issue, I suggest that the problem may lie with injectionMatch. Could you apply a following ConfigMap to the namespace where the target (profiled) pod(s) exist (replace [target-namespace] with the actual namespace), and then send me back the injection.log file from inside the target pod(s)? It is better to be done before starting/re-starting the profiled pods.
Error creating: Internal error occurred: json: cannot unmarshal number into Go struct field EnvVar.spec.containers.env.value of type string
I haven’t been able to reproduce the issue yet. Could you please let me know if adding profile.env: [] and profile.defaultEnv: [] resolves the problem? For example, with your custom_values:
sidecarImage:
image: nvcr.io/nvidia/devtools/nvidia-devtools-sidecar-injector:1.0.6
devtoolBinariesImage:
image: nvcr.io/nvidia/devtools/nsight-systems-cli:2024.4.1-ubuntu22.04
profile:
# CLI options: https://docs.nvidia.com/nsight-systems/UserGuide/index.html#cli-command-switches
# delay and duration values in secs
# Use %{} to include environment variables in the Nsight report filename
# The arguments for the Nsight Systems. The placeholders will be replaced with the actual values.
devtoolArgs: "profile --force-overwrite true --trace nvtx,cuda --delay 50 --duration 60 -o /data/data/Megatron-LM/results/nsyslog/auto_{UID}.nsys-rep"
injectionMatch: "^/usr/bin/python /usr/local/bin/torchrun.*$"
# New values
defaultEnv: []
env: []
Hi mpopov!
Thanks for your reply!
Absolutely! I have followed your recommendations and applied the suggested ConfigMap to the [target-namespace], ensuring it was in place before the profiled pods were started or restarted.
I have successfully retrieved the .nsys-rep file from the specified path and the injection.log file from within the target pod(s). injection.log files injection.log (86.9 KB)
have been sent to you for review.
I can confirm that the issue was indeed related to injectionMatch, as you suggested. Your guidance has been crucial in resolving this problem.
Additionally, I have tested the configuration with profile.env: [] and profile.defaultEnv: [], and it has effectively addressed the issue. Your input has been invaluable in this process.
Should there be any further actions required or if you need any additional support, please do not hesitate to reach out. Your expertise and assistance have been greatly appreciated, and I look forward to any future advice you may offer.