About the number of store transactions on pascal

Device CC 6.1 Pascal
CUDA 10.2
i’m experementing with two kernels to fill 2D matrix with arbitrary numbers both kernels use 2D blocks and 2D grids
first kernel store numbers by matrix columns that is takes quickest varying component of thread index as index of matrix rows
second kernel store numbers by matrix rows that is takes quickest varying component of thread index as index of matrix columns

template
global void st_by_cols(const unsigned int rows, const unsigned int cols, T* out) {
unsigned int i = blockIdx.x * blockDim.x + threadIdx.x;//row index, inner loop
unsigned int j = blockIdx.y * blockDim.y + threadIdx.y;//col index, outer loop
if (i < rows && j < cols) {
unsigned long plain_idx = i * cols + j;
out[plain_idx] = (T)(i + j) / (cols + rows);
}
}

template
global void st_by_rows(const unsigned int rows, const unsigned int cols, T* out) {
unsigned int j = blockIdx.x * blockDim.x + threadIdx.x;//col index, inner loop
unsigned int i = blockIdx.y * blockDim.y + threadIdx.y;//row index, outer loop
if (i < rows && j < cols) {
unsigned long plain_idx = i * cols + j;
out[plain_idx] = (T)(i + j) / (cols + rows);
}
}

int main(int argc, const char* argv)
{
const unsigned int rows{ 1024 * 4 };
const unsigned int cols{ 1024 * 4 };
float* dev_ptr{};
const std::size_t size{ rows * cols * sizeof(float) };
cuda_error_check(cudaMalloc(&dev_ptr, size));
dim3 blockDim_st_by_cols{ std::stoul(argv[1]), std::stoul(argv[2]) };
dim3 gridDim_st_by_cols{ rows / blockDim_st_by_cols.x,cols / blockDim_st_by_cols.y };
dim3 blockDim_st_by_rows{ std::stoul(argv[3]), std::stoul(argv[4]) };
dim3 gridDim_st_by_rows{ cols / blockDim_st_by_rows.x,rows / blockDim_st_by_rows.y };
st_by_cols << <gridDim_st_by_cols, blockDim_st_by_cols >> > (rows, cols, dev_ptr);
st_by_rows << <gridDim_st_by_rows, blockDim_st_by_rows >> > (rows, cols, dev_ptr);
cuda_error_check(cudaFree(dev_ptr));
return 0;
}

launching kernels with different block sizes i got some unexpected profiler results for the number of write requests

result 1
nv-nsight-cu-cli --metrics launch__thread_count,lts__request_tex_write_sectors_global_nonatom matrix_fill.exe 2 32 16 32
void st_by_cols(unsigned int,unsigned int,float*), 2020-Jun-01 17:41:41, Context 1, Stream 7
Section: Command line profiler metrics
---------------------------------------------------------------------- --------------- -----------------
launch__thread_count thread 16а777а216
lts__request_tex_write_sectors_global_nonatom sector 2а103а587
---------------------------------------------------------------------- --------------- -----------------
void st_by_rows(unsigned int,unsigned int,float*), 2020-Jun-01 17:41:41, Context 1, Stream 7
Section: Command line profiler metrics
---------------------------------------------------------------------- --------------- -----------------
launch__thread_count thread 16а777а216
lts__request_tex_write_sectors_global_nonatom sector 2а103а587
---------------------------------------------------------------------- --------------- -----------------
as expected 2а103а587*32/16а777а216 = 4 32B sectors requests per warp in both case

result 2
nv-nsight-cu-cli --metrics launch__thread_count,lts__request_tex_write_sectors_global_nonatom matrix_fill.exe 4 32 8 32
void st_by_cols(unsigned int,unsigned int,float*), 2020-Jun-01 17:45:29, Context 1, Stream 7
Section: Command line profiler metrics
---------------------------------------------------------------------- --------------- -----------------
launch__thread_count thread 16а777а216
lts__request_tex_write_sectors_global_nonatom sector 8а395а043
---------------------------------------------------------------------- --------------- -----------------
void st_by_rows(unsigned int,unsigned int,float*), 2020-Jun-01 17:45:29, Context 1, Stream 7
Section: Command line profiler metrics
---------------------------------------------------------------------- --------------- -----------------
launch__thread_count thread 16а777а216
lts__request_tex_write_sectors_global_nonatom sector 4а200а739
---------------------------------------------------------------------- --------------- -----------------

result 3
nv-nsight-cu-cli --metrics launch__thread_count,lts__request_tex_write_sectors_global_nonatom matrix_fill.exe 8 32 4 32
void st_by_cols(unsigned int,unsigned int,float*), 2020-Jun-01 18:01:01, Context 1, Stream 7
Section: Command line profiler metrics
---------------------------------------------------------------------- --------------- -----------------
launch__thread_count thread 16а777а216
lts__request_tex_write_sectors_global_nonatom sector 8а395а043
---------------------------------------------------------------------- --------------- -----------------
void st_by_rows(unsigned int,unsigned int,float*), 2020-Jun-01 18:01:01, Context 1, Stream 7
Section: Command line profiler metrics
---------------------------------------------------------------------- --------------- -----------------
launch__thread_count thread 16а777а216
lts__request_tex_write_sectors_global_nonatom sector 4а200а739
---------------------------------------------------------------------- --------------- -----------------
roughly the same results are also reported by
nvprof --metrics l2_write_transactions .\x64\Release\matrix_fill.exe 2 32 16 32
nvprof --metrics l2_write_transactions .\x64\Release\matrix_fill.exe 4 32 8 32
nvprof --metrics l2_write_transactions .\x64\Release\matrix_fill.exe 8 32 4 32
Could someone help me answer some QUESTIONS about this results:

about result 2:
from CUDA C Best Practice Guide:
“…For devices of compute capability 6.0 or higher, the requirements can be summarized quite easily: the concurrent accesses of the threads of a warp will coalesce into a number of transactions equal to the number of 32-byte transactions necessary to service all of the threads of the warp.”

threads within warp stores to 4 32B sectors (full sector) in both cases (in different order), equal number of requests expected.
Why the reported by profiler number of requests is not the same as for result 1 as threads within warp also stores to 4 32B sectors ?
Why profiler reports twice more requests for st_by_cols kernel ?
Why the number of requests per warp is not equal to 4 for any kernel?
(st_by_cols 8а395а04332/16а777а216 = 16 requests per warp; st_by_rows 4а200а73932/16а777а216 = 8 requests per warp)

about result 3: threads within warp stores to 8 32B sectors(16 byte/sector) in both cases (in different order), equal number of transactions expected.
Why profiler reports twice more requests for st_by_cols kernel ?
Why the number of requests is the same as for result 2 when threads within warp stores twice more 32B sectors?