NCU profiling with cache-control

I am trying to profile a simple pytorch model using NCU. I am using –cache-control none flag and hoping there are no cache flush between the kernels, so all the data will reside in l2. The model size is very small to fit everything in L2. But in the ncu report, it still shows read from device memory in the following kernel. What else could be missing in the ncu profile? or is there any limits for --cache-control on pytorch model?

You will have to consider a few other factors that can cause your cache to have unexpected data.

  • Is the same GPU running a display (such as an X server), or any other concurrent workloads? This would result in device memory traffic, too.
  • Are you collecting a set of metrics that requires the kernel to be replayed over multiple passes (the command line output would indicate this)? If so, the tool saves and restores memory between the passes, which would definitely affect your caches. There are two options to deal with this:
  • Profile a set of metrics that can be collected in a single pass. You would have to experiment to find a list that fulfills this criterium.
  • Use --replay-mode application to run the entire application N times to collect N passes. This avoids memory save and restore, since it’s done by the application itself. You need to ensure some level of determinism in the execution though to allow the tool to match data to results across multiple runs.

Thanks a lot for your feedback!