approximately the only method for in-kernel timing/profiling is the use of functions like clock64() (clock() and PTX globaltimer would be functions in the same category, from my perspective, at this level of discussion). It is extensively discussed in forum posts, so a search will turn those up for you. There are various challenges:
the compiler may not respect the ordering of (C++ source) code you have.
the conversion from C++ source code to SASS by the compiler may include things you didn’t expect, or exclude things you did expect to be included
a load from global operation is usually “fire-and-forget” or asynchronous, so casual attempts at timing may not capture the time to actually fetch the data (if that is what you are interested in.) They may only capture the time to “issue” the instruction.
When I am using clock64() this way, I usually try to keep my test cases “simple” and confirm by studying the SASS that my intent is actually reflected at the SASS level.
In my case, I write two kernels (no use of async) for the same question. The first one have double memory access from L2 cache to gmem(profile by ncu), but with 10% less calculation than the second one. I surprisingly find that the first one is about 10% faster than the second one in nearly all data scope and shape. Because I haven’t use async, so I think I need to measure data loading time to figure out where the problem is (by profiling, I find very little bank conflict and uncoalesced memory access).
measuring data loading time will be more difficult (using bracketing with clock64()). At the SASS level, a global load instruction (LD or LDG) will be “asynchronous” in the sense that the instruction will issue (and appear to complete, in the sense that the next instruction can be issued), but the data requested will not (necessarily) be loaded into the designated register(s). A simple bracketing with clock64() will therefore capture the time to issue the LD or LDG instruction – which might typically be on the order of 10 cycles – but not the time to load the data – which might typically be on the order of 100 cycles or more, if going to actual DRAM.
So just as it would be with timing kernels you would need to “capture” both the instruction issue as well as probably the consumption of the data, in the bracketed region of clock64(). Because the warp is almost surely stalled at this point, the net effect may be that you are effectively timing other things besides just the time to load the data. (when timing kernels you need to capture the issue point – the launch – as well as a synchronization point, typically).
It’s not a trivial matter to use clock64() this way, and get sensible results. It may require some iterations with your test case coding, and studying of SASS as well as results.
nevertheless, those two approaches (using the profiler, or using e.g. clock64()) are the two most likely avenues to get more info about what is going on, I would guess.
When you compare profiler statistics from Nsight Compute for the two variants, are there any significant differences in any of the metrics? If so, I would expect them to point in the direction of the underlying root causes.
I acknowledge that even with profiler statistics in hand, it is not always possible to pinpoint an exact root cause for observed performance differences. This is a general issue with complex processors (think butterfly effect) and not limited to GPUs, but probably more pronounced for these because the internal details of GPUs are not necessarily documented to the level of detail one would need to truly understand what is happening.
A practical approach in the context of insufficiently accurate modelling, whether in the form of mental models or actual software models, is to adopt an auto-tuning process that uses training run(s) to determine the optimal configuration from among multiple design elements for a particular workload. There is a long history of this in the BLAS and particularly the GEMM space. A prominent example of this is ATLAS. Auto-tuning approaches have in general proven competitive with hand-coded implementations by experts (a historical example would be GotoBLAS).