How to measure memory workload with nsight compute when SLM is running in Orin

Hi,
I watched this video, it is very cool, I want to know how to run this tool and measure memory workload, I saw Orin has installed nsys-ui, I can launch it, but it looks like different with one in this video. do you have a website?

Launch SLM command: jetson-containers run $(autotag nano_llm) python3 -m nano_llm.chat --api=mlc --model princeton-nlp/Sheared-LLaMA-2.7B-ShareGPT

Hi, @MicroSW

The tool is Nsight Compute. Its binary is ncu-ui. Do you have it in your Orin ?
You can refer Getting Started with Nsight Compute | NVIDIA Developer for download and documentation.

hi @veraj

thanks. I installed this tool ncu-ui, do you have any idea how to trace memroy workload, I used this command to trace:

sudo nsys profile --trace=cuda --cudabacktrace memory --cuda-memory-usage true --gpu-metrics-device help jetson-containers run $(autotag nano_llm) python3 -m nano_llm.chat --api=mlc --model princeton-nlp/Sheared-LLaMA-2.7B-ShareGPT

this cannot capture any GPU memory info, do you know how to modify above command to trace memory workload?

the second question is that this SLM is running in container, do you know it is possible that ncu or nsys to trace container’s SLM?

I tried ncu:

it always shows
Preparing to launch…

Launched process: jetson-containers (pid: 6101)

/mnt/jetson-containers/jetson-containers run $(autotag nano_llm) python3 -m nano_llm.chat --api=mlc --model princeton-nlp/Sheared-LLaMA-2.7B-ShareGPT

Attempting to automatically connect…

Searching for attachable process 6101 on local socket…

Searching for attachable process 6101 on local socket…

Searching for attachable process 6101 on local socket…

Searching for attachable process 6101 on local socket…

Searching for attachable process 6101 on local socket…

Searching for attachable process 6101 on local socket…

Searching for attachable process 6101 on local socket…

Searching for attachable process 6101 on local socket…

Searching for attachable process 6101 on local socket…

Searching for attachable process 6101 on local socket…

Searching for attachable process 6101 on local socket…

Searching for attachable process 6101 on local socket…

Searching for attachable process 6101 on local socket…

Searching for attachable process 6101 on local socket…

Searching for attachable process 6101 on local socket…

Searching for attachable process 6101 on local socket…

Searching for attachable process 6101 on local socket…

Searching for attachable process 6101 on local socket…

Searching for attachable process 6101 on local socket…

Searching for attachable process 6101 on local socket…

Searching for attachable process 6101 on local socket…

Searching for attachable process 6101 on local socket…

Searching for attachable process 6101 on local socket…

Searching for attachable process 6101 on local socket…

Searching for attachable process 6101 on local socket…

Searching for attachable process 6101 on local socket…

Searching for attachable process 6101 on local socket…

Searching for attachable process 6101 on local socket…

Searching for attachable process 6101 on local socket…

Searching for attachable process 6101 on local socket…

Searching for attachable process 6101 on local socket…

Searching for attachable process 6101 on local socket…

Searching for attachable process 6101 on local socket…

Searching for attachable process 6101 on local socket…

Searching for attachable process 6101 on local socket…

Searching for attachable process 6101 on local socket…

Searching for attachable process 6101 on local socket…

Searching for attachable process 6101 on local socket…

Please enter into your container, and execute

sudo ncu --set full -o report python3 -m nano_llm.chat --api=mlc --model princeton-nlp/Sheared-LLaMA-2.7B-ShareGPT

Then open the generated report in NCU-UI, and you can memory details in “Details” page.

thanks so much for your support, now the last question, do you know how to let it run background, and not interrupt my input? because I cannot input my question in SLM with this command.

it continues showing this:

Hi, @MicroSW

This can be achieved using --mode option. You can launch application from one terminal and attach it from other terminal, this way you can give inputs from launch terminal and profiling logs will be on the attach terminal.

Pasting from ./ncu --help
Launch an application for later attach:
ncu --mode=launch MyApp
Attach to a previously launched application:
ncu --mode=attach --hostname 127.0.0.1

hi @veraj

thanks so much again, I launched app in one terminal, and then attached in the second terminal, its results as below:

launch command: /opt/nvidia/nsight-compute/2022.2.1/ncu --mode=launch python3 -m nano_llm.chat --api=mlc --model princeton-nlp/Sheared-LLaMA-2.7B-ShareGPT

attach command: ncu --mode=attach --hostname 127.0.0.1

the input terminal was stuck there forever, and didn’t show a response, but the second terminal, it continues showing the log:

wait…

I left it running for half of hour, then it shows :I am,

It seems that the mcu pauses to respond. When I stop the mcu in the second terminal, the response shows smoothly. Do you know how to make the ncu not pause to respond?

Hi, @MicroSW

Please add more filter option in your ncu --mode=attach command line to reduce profile workload. See 4. Nsight Compute CLI — NsightCompute 12.4 documentation Customizing data collection

Hi @veraj

really appreciated your support, I tried several options. it failed. I only want to measure and collect memory data, do you know which option I only add in?

I tried:

1, /ncu --mode=attach -f --section=MemoryWorkloadAnalysis -o report --hostname 127.0.0.1,
2, ./ncu --mode=attach -f --section=MemoryWorkloadAnalysis --section=MemoryWorkloadAnalysis_Chart --section=MemoryWorkloadAnalysis_Tables -o report --hostname 127.0.0.1

What do you mean “failed”?
Do you see any error printed or still ncu seems pause to respond ?

@veraj

sorry for the confusing statement, I meant it’s the same as no filter option, ncu still blocks the response speed. would you please check and tell me how can I achieve this:

1, let ncu collect background, don’t block input and output
2, only let ncu collect memory workload, bandwidth, throughput

Thanks ! I will check internally.

Plus, is it possible to provide a mini-repo that can show the block issue?

Hi @veraj

do you meant the log or screen shoot like this:

1, launch app, and attach

2, type text in lllama, and then as you see, the output/response from llama is blocked:

Hi, @MicroSW

I noticed you are using a very old version 2022.2.1. We have many new features and fixes since then.
Can you update to a latest version to check ?

In the meantime, we’ll try to repro this internally.

@veraj I noticed the latest ncu doesn’t have a separate package that allows us to install ncu alone, it is

Nsight Compute is available in the CUDA Toolkit bundled in the JetPack SDK.

do you have any idea which stable version has a tar package or use apt install it?

I tried to install it via sdkmanager, but ncu is not in the developer tools list:

Yes. This is by design. Recently SDK Manager removes the standalone Nsight Compute as it is already bundled in the cuda toolkit.