NCU profiling with cache-control

cmos.matrix · March 14, 2023, 11:47pm

I am trying to profile a simple pytorch model using NCU. I am using –cache-control none flag and hoping there are no cache flush between the kernels, so all the data will reside in l2. The model size is very small to fit everything in L2. But in the ncu report, it still shows read from device memory in the following kernel. What else could be missing in the ncu profile? or is there any limits for --cache-control on pytorch model?

felix_dt · March 15, 2023, 10:37am

You will have to consider a few other factors that can cause your cache to have unexpected data.

Is the same GPU running a display (such as an X server), or any other concurrent workloads? This would result in device memory traffic, too.
Are you collecting a set of metrics that requires the kernel to be replayed over multiple passes (the command line output would indicate this)? If so, the tool saves and restores memory between the passes, which would definitely affect your caches. There are two options to deal with this:
Profile a set of metrics that can be collected in a single pass. You would have to experiment to find a list that fulfills this criterium.
Use --replay-mode application to run the entire application N times to collect N passes. This avoids memory save and restore, since it’s done by the application itself. You need to ensure some level of determinism in the execution though to allow the tool to match data to results across multiple runs.

cmos.matrix · March 15, 2023, 7:07pm

Thanks a lot for your feedback!

Topic		Replies	Views
Profiling fails on more than one gpu device Nsight Compute	9	1178	November 15, 2023
How to get the bytes read/write sum about Memory access between GPUs? Nsight Compute	7	998	March 20, 2024
Why does Throughput improve when profiling my TensorRT model inference performance using ncu Nsight Compute	4	314	July 18, 2024
Nsight-compute print "the application returned an error code (249)" Nsight Compute	5	1583	February 13, 2023
About using ncu to profile the python code, which further called cu kernels Nsight Compute	13	1327	June 15, 2024
Profile 2 kernels at once Nsight Compute	5	149	September 13, 2025
Error failed to profile kernel Nsight Compute cuda , nsight	3	872	May 18, 2023
Random Freezing Trying to Profile Megatron-LM on Multiple GPUs Nsight Compute	9	1046	July 22, 2024
Lunching N times the same kernel Nsight Compute	16	923	July 18, 2023
Is serialization unavoidable while profiling L2 cache miss rates for concurrent kernels with Nsight Compute? Nsight Compute	2	22	December 10, 2025

NCU profiling with cache-control

Related topics