Nvprof and visual profiler about memory and cache access again

13939941607 · March 21, 2022, 1:36am

Hello, I have a Jetson nano. I want to run a deep learning inference program in it and analyze its memory access. The inference program is executed by the python 3 interpreter, which inputs a picture into the built-in neural network of torch, such as resnet50, and outputs the inference result.
Question: I want to know how much memory is accessed by the data from DRAM to L2, then to L1, and then to the kernel when the program is running.
Here is my analysis process:

(1) I plan to use nvprof to analyze the program.
Because it is CC5 3. Therefore, it is not supported to collect the memory accesses from DRAM to L2, as shown in the figure below (reference: https://docs.nvidia.com/cuda/profiler-users-guide/index.html#metrics-reference-5x))

(2)however, according to the output of nvprof – query metrics, as shown in the following figure:

I can collect gld_transactions, converted to MB (gld_transactions * 4 / 1024 / 1024 (MB)), can it represent the number of bytes read by the kernel from L1 cache?
And L2_ global_ load_ Bytes, converted to MB (l2_global_load_bytes / 1024 / 1024 (MB)), can it represent the number of bytes read by L1 from L2?
(3)Since nvprof’s metrics cannot collect the amount of DRAM memory accessed by L2, I found the visual profiler tool in the previous link, which can analyze the memory flow:

However, I found that the article pointed out that the Jetson nano cannot directly use visual Profiler:

(4)So when I execute the program, I use the following statements:
sudo /usr/local/cuda/bin/nvprof -o tf-resnet50.nvvp python3 resnet50-infer.py
Save the analysis results to tf-resnet50.nvvp, use the same version of visual profiler on the PC to tf-resnet50.nvvp analysis yielded the following results:

Does the total bytes shown in the bottom right corner of memcpy (htod) represent the amount of memory accessed by L2 cache from DRAM?If yes, the value corresponding to memcpy (htod) can be collected in Jetson nano using nvprof – which metric in query metrics?Because the data of visual profiler is collected by nvprof.

It can be summarized into three questions:

(1) gld_transactions, converted to MB (gld_transactions * 4 / 1024 / 1024 (MB)), can it represent the number of bytes read by the kernel from L1 cache?

（2）l2_ global_ load_ Bytes, converted to MB (l2_global_load_bytes / 1024 / 1024 (MB)), can it represent the number of bytes read by L1 from L2?

(3) Does memcpy (htod) in visual profiler represent the amount of data read from DRAM by L2 cache? If yes, the value corresponding to memcpy (htod) can be collected in Jetson nano using nvprof – which metric in query metrics?

Because I couldn’t get any answer, I had to ask again.
These questions have bothered me for a long time. If you can answer them patiently, I will be very grateful.

kayccc · March 21, 2022, 1:42am

Duplicated with Nvprof and visual profiler about memory and cache access？ - Jetson & Embedded Systems / Jetson Nano - NVIDIA Developer Forums

13939941607 · March 21, 2022, 2:02am

Yes, it was posted by me, but there was no response, so I asked again.

13939941607 · March 21, 2022, 2:03am

Can you help solve this problem?

kayccc · March 21, 2022, 2:27am

Please don’t expect we can support you over the weekend. Thanks

13939941607 · March 21, 2022, 2:33am

OK, I think so, so I’ll ask you again

13939941607 · March 21, 2022, 2:34am

So do you have any good suggestions on this issue?

Topic		Replies	Views
Nvprof and visual profiler about memory and cache access？ Jetson Nano nsight	9	2233	March 31, 2022
Visual Profiler on Jetson Nano Jetson Nano	3	2061	December 7, 2020
[Jetson-TK1] nvprof, hardware performance counters and actual DRAM bandwidth usage Jetson TK1	2	1586	June 10, 2015
L1/L2 cache profiling in jetson nano CUDA Programming and Performance cuda , jetson-nano	2	580	January 15, 2024
Profile memory activity on Jetson TX2 Jetson TX2	1	691	March 26, 2019
Is there a way to measure DRAM throughput and transactions? Jetson TX1	4	1549	July 14, 2016
Performance numbers are different for each run on Jetson Nano Jetson Nano pytorch	5	592	July 27, 2022
Consistency of data collected by nvprof and nsight compute Nsight Compute	2	502	July 30, 2023
nvprof shows DRAM throughput greater than theoretically possible Visual Profiler and nvprof	10	1941	January 11, 2018
Jetson AGX Xavier DDR Test Jetson AGX Xavier performance	15	2032	April 17, 2020

Nvprof and visual profiler about memory and cache access again

Related topics