Memory coalescing differences in consecutive kernel launches

I have a kernel that is invoked multiple times from the host code iteratively.
But, the output of the profiler regarding the coalesced and uncoalesced accesses are not the same for these kernel calls.

The output of the profiler shows zero gst uncoalesced for the first kernel call, but 128 gst uncoalesced for the consecutive 5 kernels (which are all the same kernel code)
and shows 320 gld coalesced for the first kernel call, but 256 gld coalesced for the the consecutive 5 kernels.

Any thoughts?

Thanks much.

Do you have enough blocks so load is distributed evenly? Do none of the input parameters (block/grid size, and input parameters) change?

Also, I have some kernels where only the first block is completely coalesced. So if the MP that is doing the profiling happens to run that first block, the results may differ.