OpenVLA issue - died with <Signals.SIGKILL: 9>

Hi,

I try OpenVLA from the following link, and I use Orin NX 16G 512G ssd.

After typing following command

jetson-containers run $(autotag nano_llm)
python3 -m nano_llm.vision.vla --api mlc
–model openvla/openvla-7b
–quantization q4f16_ft
–dataset dusty-nv/bridge_orig_ep100
–dataset-type rlds
–max-episodes 10
–save-stats /data/benchmarks/openvla_bridge_int4.json

I got the error logs below. Please suggest how to solve it. Thanks

Traceback (most recent call last): | 1/327 [00:03<16:47, 3.09s/tensors]
File “/usr/lib/python3.10/runpy.py”, line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File “/usr/lib/python3.10/runpy.py”, line 86, in _run_code
exec(code, run_globals)
File “/opt/NanoLLM/nano_llm/vision/vla.py”, line 446, in
vla_process_dataset(**{**vars(args), ‘dataset’: dataset})
File “/opt/NanoLLM/nano_llm/vision/vla.py”, line 296, in vla_process_dataset
model = NanoLLM.from_pretrained(model, **kwargs)
File “/opt/NanoLLM/nano_llm/nano_llm.py”, line 91, in from_pretrained
model = MLCModel(model_path, **kwargs)
File “/opt/NanoLLM/nano_llm/models/mlc.py”, line 60, in init
quant = MLCModel.quantize(self.model_path, self.config, method=quantization, max_context_len=max_context_len, **kwargs)
File “/opt/NanoLLM/nano_llm/models/mlc.py”, line 276, in quantize
subprocess.run(cmd, executable=‘/bin/bash’, shell=True, check=True)
File “/usr/lib/python3.10/subprocess.py”, line 526, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command 'python3 -m mlc_llm.build --model /data/models/mlc/dist/models/openvla-7b --quantization q4f16_ft --target cuda --use-cuda-graph --use-flash-attn-mqa --sep-embed --max-seq-len 4096 --artifact-path /data/models/mlc/dist/openvla-7b/ctx4096 --use-safetensors ’ died with <Signals.SIGKILL: 9>.

*** Please note that this reply is generated by LLM automatically ***

The error you’re encountering appears to be related to the mlc model quantization process, specifically when trying to build the model using the mlcllmbuild command. The error message indicates that the command python3 -m mlcllmbuild ... failed with a non-zero exit code.

To troubleshoot this issue, let’s break down the possible causes:

  1. MLC Model Quantization: The error occurs during the model quantization process, which is a critical step in optimizing the model for deployment on the Jetson Orin NX. The mlcllmbuild command is used to build the quantized model, but it’s failing for some reason.
  2. CUDA and cuDNN: The error message mentions cuda and cudagraph as part of the command. This suggests that the issue might be related to the CUDA or cuDNN installation on your Jetson Orin NX. Ensure that you have the correct versions of CUDA and cuDNN installed and configured properly.
  3. Model Path and Configuration: The mlcllmbuild command takes several arguments, including the model path, quantization method, and target device (in this case, cuda). Verify that the model path is correct and that the quantization method (q4f16ft) is supported by the mlc model.
  4. Subprocess Error: The error message indicates that the subprocess module is raising a CalledProcessError. This suggests that the issue might be related to the way the mlcllmbuild command is being executed as a subprocess.

To resolve this issue, you can try the following:

  1. Check CUDA and cuDNN versions: Ensure that you have the correct versions of CUDA and cuDNN installed on your Jetson Orin NX. You can check the versions using the nvcc --version and nvidia-smi commands.
  2. Verify model path and configuration: Double-check that the model path is correct and that the quantization method is supported by the mlc model.
  3. Run the mlcllmbuild command manually: Try running the mlcllmbuild command manually to see if you can reproduce the error. This will help you identify if the issue is specific to the mlc model or the way it’s being executed as a subprocess.
  4. Check the mlc model documentation: Consult the mlc model documentation to ensure that you’re using the correct arguments and options for the mlcllmbuild command.

If none of these suggestions help, please provide more details about your environment, including the CUDA and cuDNN versions, and the exact command you’re running. This will help me provide a more specific solution to your problem.

*** Please give the thumbs up if you get answers you like. Or provide feedback to help us improve the answer. ***

Hi,

Usually, the SIGKILL is caused by running out of memory.
Could you monitor the system memory usage with the command below to confirm?

$ sudo tegrastats

Thanks.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.