What is the meaning of the word kernel in the memory workload analysis

cuda_new_bird · April 11, 2024, 6:48am

I mean the left-most kernel in the image, what do they mean? can I see it as register file?

felix_dt · April 11, 2024, 7:10am

This chart is documented here: 2. Kernel Profiling Guide — NsightCompute 12.4 documentation
“Kernel” does not represent the register file, but the kernel executing on the GPU which executes instructions.

cuda_new_bird · April 12, 2024, 8:54am

Thank you! I really need to read the manual more carefully.

cuda_new_bird · April 12, 2024, 9:01am

I’m sorry but I still have a question.

The red line from L2 cache to shared memory, I know it’s the new feature from Ampere, which enables cuda::memcpy_async. I wonder whether the data goes through the L1 cache.

As I understand it, the L1 cache and Shared MEM are physically the same hardware, just partitioned differently, right?
So does the data go directly from L2 to SMEM, or does it pass through L1?

cuda_new_bird · April 17, 2024, 1:41am

@felix_dt could you please take some time to answer my question, thank you!

system · May 1, 2024, 1:41am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

felix_dt · May 14, 2024, 7:12am

As I understand it, the L1 cache and Shared MEM are physically the same hardware, just partitioned differently, right?

That is correct, they are physically the same unit, partitioned into the two. However, if you load from global memory into shared memory, the data will still be implicitly cached.

So does the data go directly from L2 to SMEM, or does it pass through L1?

It still passes through L1, but is not staged into registers anymore before reaching shared memory. This is implied by the line passing through L1, instead of around it. This blog should provide more background: htts://developer.nvidia.com/blog/controlling-data-movement-to-boost-performance-on-ampere-architecture/

Topic		Replies	Views
What's the difference between L1 cache and the shared memory CUDA Programming and Performance	4	15405	October 29, 2011
How to optimize for cache + shared memory on Fermi? CUDA Programming and Performance	8	3104	April 25, 2010
Loading global memory values into shared memory CUDA Programming and Performance	2	913	April 19, 2013
L1 Cache, L2 Cache and Shared memory in Fermi CUDA Programming and Performance	5	23675	March 21, 2011
life span of shared memory CUDA Programming and Performance	15	7061	April 27, 2011
global memory caching CUDA Programming and Performance	4	1472	March 13, 2012
How to understand lmem, smem, reg? CUDA Programming and Performance	5	4605	March 23, 2011
access speed of shared memory and global memory CUDA Programming and Performance	1	1100	August 6, 2009
Cant understand Shared Memory Concept ! I want to talk Live to somebody who knows it !!& CUDA Programming and Performance	2	1484	April 13, 2009
Getting access to shared memory from different kernels is there a way to share it? CUDA Programming and Performance	4	3804	May 13, 2009

What is the meaning of the word kernel in the memory workload analysis

Related topics