I tried the NVIDIA-Nsight-Compute-2024.1 to trace dram workload while Llama is launched, but experienced the Segmentation fault:
on the first console:
root@orin1:/data/NVIDIA-Nsight-Compute-2024.1# ls
docs EULA.txt extras host ncu ncu-ui sections target
root@orin1:/data/NVIDIA-Nsight-Compute-2024.1# ./ncu --mode=launch python3 -m nano_llm.chat --api=mlc --model princeton-nlp/Sheared-LLaMA-2.7B-ShareGPT
==PROF== Waiting for profiler to attach on ports 49152-49215.
/usr/lib/python3/dist-packages/requests/init.py:89: RequestsDependencyWarning: urllib3 (1.26.18) or chardet (3.0.4) doesn’t match a supported version!
warnings.warn("urllib3 ({}) or chardet ({}) doesn’t match a supported "
/usr/local/lib/python3.8/dist-packages/transformers/utils/hub.py:124: FutureWarning: Using TRANSFORMERS_CACHE
is deprecated and will be removed in v5 of Transformers. Use HF_HOME
instead.
warnings.warn(
Fetching 14 files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 14/14 [00:00<00:00, 126280.12it/s]
14:02:15 | INFO | loading /data/models/huggingface/models–princeton-nlp–Sheared-LLaMA-2.7B-ShareGPT/snapshots/802be8903ec44f49a883915882868b479ecdcc3b with MLC
You are using the default legacy behaviour of the <class ‘transformers.models.llama.tokenization_llama.LlamaTokenizer’>. This is expected, and simply means that the legacy
(previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set legacy=False
. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in ⚠️⚠️[`T5Tokenize`] Fix T5 family tokenizers⚠️⚠️ by ArthurZucker · Pull Request #24565 · huggingface/transformers · GitHub
14:02:48 | INFO | device=cuda(0), name=Orin, compute=8.7, max_clocks=1300000, multiprocessors=16, max_thread_dims=[1024, 1024, 64], api_version=11040, driver_version=None
14:02:48 | INFO | loading Sheared-LLaMA-2.7B-ShareGPT from /data/models/mlc/dist/Sheared-LLaMA-2.7B-ShareGPT-ctx4096/Sheared-LLaMA-2.7B-ShareGPT-q4f16_ft/Sheared-LLaMA-2.7B-ShareGPT-q4f16_ft-cuda.so
Fatal Python error: Segmentation fault
Thread 0x0000fffed09c9f40 (most recent call first):
File “/usr/lib/python3.8/threading.py”, line 306 in wait
File “/usr/lib/python3.8/threading.py”, line 558 in wait
File “/usr/local/lib/python3.8/dist-packages/tqdm/_monitor.py”, line 60 in run
File “/usr/lib/python3.8/threading.py”, line 932 in _bootstrap_inner
File “/usr/lib/python3.8/threading.py”, line 890 in _bootstrap
Current thread 0x0000ffff9dde5980 (most recent call first):
File “/opt/NanoLLM/nano_llm/models/mlc.py”, line 110 in init
File “/opt/NanoLLM/nano_llm/nano_llm.py”, line 71 in from_pretrained
File “/opt/NanoLLM/nano_llm/chat/main.py”, line 29 in
File “/usr/lib/python3.8/runpy.py”, line 87 in _run_code
File “/usr/lib/python3.8/runpy.py”, line 194 in _run_module_as_main
==ERROR== The application returned an error code (11).
root@orin1:/data/NVIDIA-Nsight-Compute-2024.1#
in the second console:
root@orin1:/data/NVIDIA-Nsight-Compute-2024.1# ./ncu --mode=attach -f --section=MemoryWorkloadAnalysis --section=MemoryWorkloadAnalysis_Chart --section=MemoryWorkloadAnalysis_Tables -o report --hostname 127.0.0.1
==PROF== Finding attachable processes on host 127.0.0.1.
==PROF== Attaching to process ‘/usr/bin/python3.8’ (308) on port 49152.
==PROF== Connected to process 308 (/usr/bin/python3.8)
==WARNING== No kernels were profiled.
==WARNING== Profiling kernels launched by child processes requires the --target-processes all option.
I ever used ncu 2022, there is no segmentation fault issue, but it ncu 2022 will suspend the llama response, I have created the topic here
based on @ veraj suggestion, I should use the latest version of ncu.
@veraj would you please help me check this internally?