Segmentation fault

I tried the NVIDIA-Nsight-Compute-2024.1 to trace dram workload while Llama is launched, but experienced the Segmentation fault:

on the first console:

root@orin1:/data/NVIDIA-Nsight-Compute-2024.1# ls
docs EULA.txt extras host ncu ncu-ui sections target
root@orin1:/data/NVIDIA-Nsight-Compute-2024.1# ./ncu --mode=launch python3 -m nano_llm.chat --api=mlc --model princeton-nlp/Sheared-LLaMA-2.7B-ShareGPT
==PROF== Waiting for profiler to attach on ports 49152-49215.
/usr/lib/python3/dist-packages/requests/init.py:89: RequestsDependencyWarning: urllib3 (1.26.18) or chardet (3.0.4) doesn’t match a supported version!
warnings.warn("urllib3 ({}) or chardet ({}) doesn’t match a supported "
/usr/local/lib/python3.8/dist-packages/transformers/utils/hub.py:124: FutureWarning: Using TRANSFORMERS_CACHE is deprecated and will be removed in v5 of Transformers. Use HF_HOME instead.
warnings.warn(
Fetching 14 files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 14/14 [00:00<00:00, 126280.12it/s]
14:02:15 | INFO | loading /data/models/huggingface/models–princeton-nlp–Sheared-LLaMA-2.7B-ShareGPT/snapshots/802be8903ec44f49a883915882868b479ecdcc3b with MLC
You are using the default legacy behaviour of the <class ‘transformers.models.llama.tokenization_llama.LlamaTokenizer’>. This is expected, and simply means that the legacy (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set legacy=False. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in ⚠️⚠️[`T5Tokenize`] Fix T5 family tokenizers⚠️⚠️ by ArthurZucker · Pull Request #24565 · huggingface/transformers · GitHub

14:02:48 | INFO | device=cuda(0), name=Orin, compute=8.7, max_clocks=1300000, multiprocessors=16, max_thread_dims=[1024, 1024, 64], api_version=11040, driver_version=None
14:02:48 | INFO | loading Sheared-LLaMA-2.7B-ShareGPT from /data/models/mlc/dist/Sheared-LLaMA-2.7B-ShareGPT-ctx4096/Sheared-LLaMA-2.7B-ShareGPT-q4f16_ft/Sheared-LLaMA-2.7B-ShareGPT-q4f16_ft-cuda.so
Fatal Python error: Segmentation fault

Thread 0x0000fffed09c9f40 (most recent call first):
File “/usr/lib/python3.8/threading.py”, line 306 in wait
File “/usr/lib/python3.8/threading.py”, line 558 in wait
File “/usr/local/lib/python3.8/dist-packages/tqdm/_monitor.py”, line 60 in run
File “/usr/lib/python3.8/threading.py”, line 932 in _bootstrap_inner
File “/usr/lib/python3.8/threading.py”, line 890 in _bootstrap

Current thread 0x0000ffff9dde5980 (most recent call first):
File “/opt/NanoLLM/nano_llm/models/mlc.py”, line 110 in init
File “/opt/NanoLLM/nano_llm/nano_llm.py”, line 71 in from_pretrained
File “/opt/NanoLLM/nano_llm/chat/main.py”, line 29 in
File “/usr/lib/python3.8/runpy.py”, line 87 in _run_code
File “/usr/lib/python3.8/runpy.py”, line 194 in _run_module_as_main
==ERROR== The application returned an error code (11).
root@orin1:/data/NVIDIA-Nsight-Compute-2024.1#

in the second console:

root@orin1:/data/NVIDIA-Nsight-Compute-2024.1# ./ncu --mode=attach -f --section=MemoryWorkloadAnalysis --section=MemoryWorkloadAnalysis_Chart --section=MemoryWorkloadAnalysis_Tables -o report --hostname 127.0.0.1
==PROF== Finding attachable processes on host 127.0.0.1.
==PROF== Attaching to process ‘/usr/bin/python3.8’ (308) on port 49152.
==PROF== Connected to process 308 (/usr/bin/python3.8)
==WARNING== No kernels were profiled.
==WARNING== Profiling kernels launched by child processes requires the --target-processes all option.

I ever used ncu 2022, there is no segmentation fault issue, but it ncu 2022 will suspend the llama response, I have created the topic here

based on @ veraj suggestion, I should use the latest version of ncu.

@veraj would you please help me check this internally?

Hi, @MicroSW

I would like to help to check.
Does the segmentation fault happen for other simple CUDA sample ?
If not, can you guide me to set up the application ?

@veraj
it is this application: Small LLM (SLM) - NVIDIA Jetson AI Lab

segmentation fault is related to nsight compute, if only run application, there is no any issue. also if using ncu 2022, there is no segmentation fault issue.

Thanks ! Will check internally and let you know if any update.

@veraj do you have any update on this case? if you need further information from my end for your investigation, let me know, thanks very much for your support!!

Sorry, no update yet. I will try to give your response this week.

Also can you tell which CUDA version is installed in your device ?
When you use Nsight 2024.1, have you installed some compat package ?

Hi @veraj
thanks, I used 35.5.0, I will check what version of CUDA is.

BTW, would you please try the command I used:

launch command:

/opt/nvidia/nsight-compute/2024.1/ncu --mode=launch python3 -m nano_llm.chat --api=mlc --model princeton-nlp/Sheared-LLaMA-2.7B-ShareGPT

attach command:

ncu --mode=attach --hostname 127.0.0.1

see if it works, and LLAMA response will not be suspended by ncu.

Hi, @MicroSW

We can successfully profile the nano_llm with below configs.

  • Orin
  • Jetpack 6.0 ( with CUDA12.0 by default and default nsight compute is 2023.2.2.0, don’t use this )
  • CUDA12.4 compat library (this maybe the reason for segmentation fault, you need to get the compat lib and set it into LD_LIBRARY_PATH, apt install cuda-compat-12-4)
  • Nsight Compute 2024.1 (We mount from host side, you can mount or install)

See the launch

See the attach

Hi @veraj
Thanks so much, I will try later. Can you get the response from LM?is there delay for response?

Hi @veraj

I tried, there is no segmentation fault, but there is new error as shown in below.

would you please help check how can I trace memory workload with ncu while llama is in execution ??

I bisect tested ncu, since 2023.2.2.3, ncu has this issue. before this version, it doesn’t have this issue, but will block llama execution. would you check how can I use ncu while application is in execution??

root@orin4:/data/nsight_compute-linux-aarch64-2023.2.2.3-archive/nsight-compute/2023.2.2# ./ncu --mode=attach -f --section=MemoryWorkloadAnalysis
==PROF== Finding attachable processes on host 127.0.0.1.
==WARNING== No processes found on 127.0.0.1 (ports 49152-49215).
==ERROR== No attachable process found.
==ERROR== An error occurred while trying to profile.
==WARNING== No kernels were profiled.
==WARNING== Profiling kernels launched by child processes requires the --target-processes all option.
root@orin4:/data/nsight_compute-linux-aarch64-2023.2.2.3-archive/nsight-compute/2023.2.2# ./ncu --mode=attach -f --section=MemoryWorkloadAnalysis
==PROF== Finding attachable processes on host 127.0.0.1.
==PROF== Attaching to process ‘/usr/bin/python3.8’ (1453) on port 49153.
==PROF== Connected to process 1453 (/usr/bin/python3.8)
==PROF== Profiling “fused_fused_decode1_take1_ker…” - 0: 0%…50%…100% - 1 pass

==ERROR== LaunchFailed
==PROF== Trying to shutdown target application
==PROF== Received signal - ending profile session.
==ERROR== An error occurred while trying to profile.
[1453] python3.8@127.0.0.1
fused_fused_decode1_take1_kernel (48, 1, 1)x(1024, 1, 1), Context 1, Stream 7, Device 0, CC 8.7
Section: Memory Workload Analysis
--------------------------- ----------- ------------
Metric Name Metric Unit Metric Value
--------------------------- ----------- ------------
Mem Busy % (!) nan
Max Bandwidth % (!) nan
L1/TEX Hit Rate % (!) nan
L2 Compression Success Rate % (!) nan
L2 Compression Ratio nan
L2 Hit Rate % (!) nan
Mem Pipes Busy % (!) nan
--------------------------- ----------- ------------

Hi, @MicroSW

We can successfully get Memory Workload Analysis result also. And the llama execution will not hang.

root@ubuntu:/# LD_LIBRARY_PATH=/usr/local/cuda-12.4/compat:/usr/local/lib/python3.10/dist-packages/tvm /NsightCompute/2024.1.0/ncu --mode attach -f --section=MemoryWorkloadAnalysis -c 5 --hostname 127.0.0.1
==PROF== Finding attachable processes on host 127.0.0.1.
==PROF== Attaching to process ‘/usr/bin/python3.10’ (398) on port 49154.
==PROF== Connected to process 398 (/usr/bin/python3.10)
==PROF== Profiling “fused_fused_decode1_take_kernel” - 0 (1/5): 0%…50%…100% - 7 passes
==PROF== Profiling “reshape1_kernel” - 1 (2/5): 0%…50%…100% - 7 passes
==PROF== Profiling “fused_fused_decode1_take_kernel” - 2 (3/5): 0%…50%…100% - 7 passes
==PROF== Profiling “reshape1_kernel” - 3 (4/5): 0%…50%…100% - 7 passes
==PROF== Profiling “rmsnorm_twoPassAlgo_e8” - 4 (5/5): 0%…50%…100% - 7 passes
==PROF== Disconnected from process 398
[398] python3.10@127.0.0.1
fused_fused_decode1_take_kernel (48, 1, 1)x(1024, 1, 1), Context 1, Stream 7, Device 0, CC 8.7
Section: Memory Workload Analysis
--------------------------- ----------- ------------
Metric Name Metric Unit Metric Value
--------------------------- ----------- ------------
Mem Busy % 7.33
Max Bandwidth % 7.23
L1/TEX Hit Rate % 66.82
L2 Compression Success Rate % 0
L2 Compression Ratio 0
L2 Hit Rate % 80.67
Mem Pipes Busy % 7.23
--------------------------- ----------- ------------

reshape1_kernel (48, 1, 1)x(1024, 1, 1), Context 1, Stream 7, Device 0, CC 8.7
Section: Memory Workload Analysis
--------------------------- ----------- ------------
Metric Name Metric Unit Metric Value
--------------------------- ----------- ------------
Mem Busy % 5.76
Max Bandwidth % 5.93
L1/TEX Hit Rate % 24.79
L2 Compression Success Rate % 0
L2 Compression Ratio 0
L2 Hit Rate % 51.45
Mem Pipes Busy % 5.33
--------------------------- ----------- ------------

fused_fused_decode1_take_kernel (28, 1, 1)x(1024, 1, 1), Context 1, Stream 7, Device 0, CC 8.7
Section: Memory Workload Analysis
--------------------------- ----------- ------------
Metric Name Metric Unit Metric Value
--------------------------- ----------- ------------
Mem Busy % 7.88
Max Bandwidth % 7.83
L1/TEX Hit Rate % 66.42
L2 Compression Success Rate % 0
L2 Compression Ratio 0
L2 Hit Rate % 83.46
Mem Pipes Busy % 7.83
--------------------------- ----------- ------------

reshape1_kernel (28, 1, 1)x(1024, 1, 1), Context 1, Stream 7, Device 0, CC 8.7
Section: Memory Workload Analysis
--------------------------- ----------- ------------
Metric Name Metric Unit Metric Value
--------------------------- ----------- ------------
Mem Busy % 4.52
Max Bandwidth % 4.12
L1/TEX Hit Rate % 24.62
L2 Compression Success Rate % 0
L2 Compression Ratio 0
L2 Hit Rate % 52.40
Mem Pipes Busy % 4.12
--------------------------- ----------- ------------

cutlass::rmsnorm_twoPassAlgo_e8(float4 *, const float4 *, const float4 *, int, int, float) (30, 1, 1)x(320, 1, 1), Context 1, Stream 7, Device 0, CC 8.7
Section: Memory Workload Analysis
--------------------------- ----------- ------------
Metric Name Metric Unit Metric Value
--------------------------- ----------- ------------
Mem Busy % 20.89
Max Bandwidth % 16.68
L1/TEX Hit Rate % 43.33
L2 Compression Success Rate % 0
L2 Compression Ratio 0
L2 Hit Rate % 54.49
Mem Pipes Busy % 11.67
--------------------------- ----------- ------------

It is more like a setup issue.

  1. Can you check that NCU can work properly with a simple CUDA SDK sample ?
  2. Is it possible that the memory of device is limited cause “Launch Failed” ?

@veraj thanks so much for your response, which gives me confidence and continue on. I think it is system setup issue, I will flash the latest sdk 36, and see if I can fix it.

BTW, do you know how to properly install cuda 12.4, and ncu?

cuda: is this one CUDA Toolkit 12.4 Downloads | NVIDIA Developer

ncu, is this one: Index of /compute/cuda/redist/nsight_compute/linux-aarch64

For cuda, you should install the native package on the target side.

For ncu, you can use the package.

Hi @veraj If I use the same parameter as you, -c 5, the llama will not be hanged, but if I remove this parameter, it still gets hanged. also, -c limits the number of kernel launches collected.

would you please check internally how to trace all execution for example if I launch this :

/opt/nvidia/nsight-compute/2024.1/ncu --mode=launch python3 -m nano_llm.chat --api=mlc --model princeton-nlp/Sheared-LLaMA-2.7B-ShareGPT --max-new-tokens 512
–prompt ‘hi, how are you?’
–prompt ‘whats the square root of 900?’
–prompt ‘can I get a recipie for french onion soup?’

Hi, @MicroSW

We didn’t see the hang even remove -c 5.
In your case, I would suggest you add some option in the command line to reduce the overhead to avoid hang happens. You can refer 2. Kernel Profiling Guide — NsightCompute 12.5 documentation

hi @veraj

To be precise, the issue is not with hanging. The ncu is slowing down the response of Llama. For the command mentioned above, it will take Llama 23 hours to finish. My original question is: how can I prevent the ncu from slowing down the response of Llama, and instead just collect memory usage information in the background and output its report file?

Hi, @MicroSW

I am afraid the slow down is expected as in either kernel or application replay, individual kernels are serialized. To avoid this you can select range or app-range replay modes.
You can refer 2. Kernel Profiling Guide — NsightCompute 12.5 documentation for details.

Hi @veraj

thanks for your support. an individual kernel is serialized. NCU will measure each kernel’s memory usage.

I’ve been running SLM with the NCU profile for two days and I input one prompt question. and NCU produced a 4GB+ profile report file, and tried to open it with ncu-ui, but ncu-ui could not open such a large profile file. do you have any suggestions on how to get the report of memory profile data? such as using a command line, extracting the data, and outputting to a CVS file?