Error on "Tutorial - Small Language Models (SLM)"

Hello. I got an "subprocess.CalledProcessError ", when I was following your Tutorial on Small Language Models (SLM) with Jetson Orin Nano (8GB) (with 128GB SD card)
(Small LLM (SLM) - NVIDIA Jetson AI Lab).
I proceeded to clone and setup jetson-containers without any problem, but when I followed the behind command according to the tutorial, there was a problem with a few minutes of screen freezing

jetson-containers run $(autotag nano_llm) \
  python3 -m --api=mlc \
    --model princeton-nlp/Sheared-LLaMA-2.7B-ShareGPT

I received the following error message.

seongkyu@ubuntu:~/jetson-containers$ jetson-containers run $(autotag nano_llm) \
>   python3 -m --api=mlc \
>     --model princeton-nlp/Sheared-LLaMA-2.7B-ShareGPT
Namespace(disable=[''], output='/tmp/autotag', packages=['nano_llm'], prefer=['local', 'registry', 'build'], quiet=False, user='dustynv', verbose=False)
-- Finding compatible container image for ['nano_llm']
localuser:root being added to access control list
+ sudo docker run --runtime nvidia -it --rm --network host --volume /tmp/argus_socket:/tmp/argus_socket --volume /etc/enctune.conf:/etc/enctune.conf --volume /etc/nv_tegra_release:/etc/nv_tegra_release --volume /tmp/nv_jetson_model:/tmp/nv_jetson_model --volume /var/run/dbus:/var/run/dbus --volume /var/run/avahi-daemon/socket:/var/run/avahi-daemon/socket --volume /var/run/docker.sock:/var/run/docker.sock --volume /home/seongkyu/jetson-containers/data:/data --device /dev/snd --device /dev/bus/usb -e DISPLAY=:1 -v /tmp/.X11-unix/:/tmp/.X11-unix -v /tmp/.docker.xauth:/tmp/.docker.xauth -e XAUTHORITY=/tmp/.docker.xauth --device /dev/video0 --device /dev/video1 dustynv/nano_llm:r35.4.1 python3 -m --api=mlc --model princeton-nlp/Sheared-LLaMA-2.7B-ShareGPT
/usr/lib/python3/dist-packages/requests/ RequestsDependencyWarning: urllib3 (1.26.18) or chardet (3.0.4) doesn't match a supported version!
  warnings.warn("urllib3 ({}) or chardet ({}) doesn't match a supported "
/usr/local/lib/python3.8/dist-packages/transformers/utils/ FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
config.json: 100%|███████████████████████████████| 740/740 [00:00<00:00, 198kB/s]
.gitattributes: 100%|████████████████████████| 1.52k/1.52k [00:00<00:00, 540kB/s]
added_tokens.json: 100%|██████████████████████| 21.0/21.0 [00:00<00:00, 1.26kB/s] 100%|████████████████████████████| 1.37k/1.37k [00:00<00:00, 37.8kB/s]
generation_config.json: 100%|███████████████████| 132/132 [00:00<00:00, 4.12kB/s]
special_tokens_map.json: 100%|██████████████████| 435/435 [00:00<00:00, 99.4kB/s]
trainer_state.json: 100%|███████████████████| 2.59k/2.59k [00:00<00:00, 1.59MB/s]
pytorch_model.bin.index.json: 100%|█████████| 26.8k/26.8k [00:00<00:00, 7.17MB/s]
tokenizer_config.json: 100%|█████████████████████| 726/726 [00:00<00:00, 401kB/s]
training_args.bin: 100%|████████████████████| 3.96k/3.96k [00:00<00:00, 1.50MB/s]
tokenizer.model: 100%|█████████████████████████| 500k/500k [00:00<00:00, 651kB/s]
test/temp0.0_num1.json: 100%|████████████████| 1.77M/1.77M [00:02<00:00, 617kB/s]
pytorch_model-00001-of-00002.bin: 100%|█████| 9.99G/9.99G [43:17<00:00, 3.84MB/s]^[[B
pytorch_model-00002-of-00002.bin: 100%|████████| 821M/821M [48:39<00:00, 281kB/s]
Fetching 14 files: 100%|████████████████████████| 14/14 [48:39<00:00, 208.56s/it]
05:26:19 | INFO | loading /data/models/huggingface/models--princeton-nlp--Sheared-LLaMA-2.7B-ShareGPT/snapshots/802be8903ec44f49a883915882868b479ecdcc3b with MLC
05:26:22 | INFO | running MLC quantization:

python3 -m --model /data/models/mlc/dist/models/Sheared-LLaMA-2.7B-ShareGPT --quantization q4f16_ft --target cuda --use-cuda-graph --use-flash-attn-mqa --sep-embed --max-seq-len 4096 --artifact-path /data/models/mlc/dist/Sheared-LLaMA-2.7B-ShareGPT-ctx4096

Using path "/data/models/mlc/dist/models/Sheared-LLaMA-2.7B-ShareGPT" for model "Sheared-LLaMA-2.7B-ShareGPT"
Target configured: cuda -keys=cuda,gpu -arch=sm_87 -max_num_threads=1024 -max_shared_memory_per_block=49152 -max_threads_per_block=1024 -registers_per_block=65536 -thread_warp_size=32
Automatically using target for weight quantization: cuda -keys=cuda,gpu -arch=sm_87 -max_num_threads=1024 -max_shared_memory_per_block=49152 -max_threads_per_block=1024 -registers_per_block=65536 -thread_warp_size=32
Get old param:   0%|                                                             | 0/197 [00:00<?, ?tensors/sStart computing and quantizing weights... This may take a while.                  | 0/327 [00:00<?, ?tensors/s]
Get old param:   1%|▌                                                    | 2/197 [00:13<23:31,  7.24s/tensors]Traceback (most recent call last):                                     | 1/327 [00:13<1:13:18, 13.49s/tensors]
  File "/usr/lib/python3.8/", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.8/", line 87, in _run_code
    exec(code, run_globals)
  File "/opt/NanoLLM/nano_llm/chat/", line 29, in <module>
    model = NanoLLM.from_pretrained(
  File "/opt/NanoLLM/nano_llm/", line 71, in from_pretrained
    model = MLCModel(model_path, **kwargs)
  File "/opt/NanoLLM/nano_llm/models/", line 59, in __init__
    quant = MLCModel.quantize(model_path, self.config, method=quantization, max_context_len=max_context_len, **kwargs)
  File "/opt/NanoLLM/nano_llm/models/", line 278, in quantize, executable='/bin/bash', shell=True, check=True)  
  File "/usr/lib/python3.8/", line 516, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command 'python3 -m --model /data/models/mlc/dist/models/Sheared-LLaMA-2.7B-ShareGPT --quantization q4f16_ft --target cuda --use-cuda-graph --use-flash-attn-mqa --sep-embed --max-seq-len 4096 --artifact-path /data/models/mlc/dist/Sheared-LLaMA-2.7B-ShareGPT-ctx4096 ' died with <Signals.SIGKILL: 9>.

How to do for this error and follow your Small Language Model tutorial well?

I also got the same error on my Orin Nano, I didn’t solve it but the model size itself is >10GB and we have only 8GB of Shared RAM and (sigkill 9) indicates OOMKiller, Surprisingly, it didn’t work even when I tried with 16GB of swap

Hi @ygoongood12, unfortunately it means that you ran out of memory - can you try mounting SWAP, disabling ZRAM, and disabling the desktop UI if needed?

Also you can try adding --max-context-len=512 to the command-line, that should further reduce the memory usage.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.