About the number of store transactions on pascal

ivanmzk · June 1, 2020, 4:28pm

Device CC 6.1 Pascal
CUDA 10.2
i’m experementing with two kernels to fill 2D matrix with arbitrary numbers both kernels use 2D blocks and 2D grids
first kernel store numbers by matrix columns that is takes quickest varying component of thread index as index of matrix rows
second kernel store numbers by matrix rows that is takes quickest varying component of thread index as index of matrix columns

template
global void st_by_cols(const unsigned int rows, const unsigned int cols, T* out) {
unsigned int i = blockIdx.x * blockDim.x + threadIdx.x;//row index, inner loop
unsigned int j = blockIdx.y * blockDim.y + threadIdx.y;//col index, outer loop
if (i < rows && j < cols) {
unsigned long plain_idx = i * cols + j;
out[plain_idx] = (T)(i + j) / (cols + rows);
}
}

template
global void st_by_rows(const unsigned int rows, const unsigned int cols, T* out) {
unsigned int j = blockIdx.x * blockDim.x + threadIdx.x;//col index, inner loop
unsigned int i = blockIdx.y * blockDim.y + threadIdx.y;//row index, outer loop
if (i < rows && j < cols) {
unsigned long plain_idx = i * cols + j;
out[plain_idx] = (T)(i + j) / (cols + rows);
}
}

int main(int argc, const char* argv)
{
const unsigned int rows{ 1024 * 4 };
const unsigned int cols{ 1024 * 4 };
float* dev_ptr{};
const std::size_t size{ rows * cols * sizeof(float) };
cuda_error_check(cudaMalloc(&dev_ptr, size));
dim3 blockDim_st_by_cols{ std::stoul(argv[1]), std::stoul(argv[2]) };
dim3 gridDim_st_by_cols{ rows / blockDim_st_by_cols.x,cols / blockDim_st_by_cols.y };
dim3 blockDim_st_by_rows{ std::stoul(argv[3]), std::stoul(argv[4]) };
dim3 gridDim_st_by_rows{ cols / blockDim_st_by_rows.x,rows / blockDim_st_by_rows.y };
st_by_cols << <gridDim_st_by_cols, blockDim_st_by_cols >> > (rows, cols, dev_ptr);
st_by_rows << <gridDim_st_by_rows, blockDim_st_by_rows >> > (rows, cols, dev_ptr);
cuda_error_check(cudaFree(dev_ptr));
return 0;
}

launching kernels with different block sizes i got some unexpected profiler results for the number of write requests

result 1
nv-nsight-cu-cli --metrics launch__thread_count,lts__request_tex_write_sectors_global_nonatom matrix_fill.exe 2 32 16 32
void st_by_cols(unsigned int,unsigned int,float*), 2020-Jun-01 17:41:41, Context 1, Stream 7
Section: Command line profiler metrics
---------------------------------------------------------------------- --------------- -----------------
launch__thread_count thread 16а777а216
lts__request_tex_write_sectors_global_nonatom sector 2а103а587
---------------------------------------------------------------------- --------------- -----------------
void st_by_rows(unsigned int,unsigned int,float*), 2020-Jun-01 17:41:41, Context 1, Stream 7
Section: Command line profiler metrics
---------------------------------------------------------------------- --------------- -----------------
launch__thread_count thread 16а777а216
lts__request_tex_write_sectors_global_nonatom sector 2а103а587
---------------------------------------------------------------------- --------------- -----------------
as expected 2а103а587*32/16а777а216 = 4 32B sectors requests per warp in both case

result 2
nv-nsight-cu-cli --metrics launch__thread_count,lts__request_tex_write_sectors_global_nonatom matrix_fill.exe 4 32 8 32
void st_by_cols(unsigned int,unsigned int,float*), 2020-Jun-01 17:45:29, Context 1, Stream 7
Section: Command line profiler metrics
---------------------------------------------------------------------- --------------- -----------------
launch__thread_count thread 16а777а216
lts__request_tex_write_sectors_global_nonatom sector 8а395а043
---------------------------------------------------------------------- --------------- -----------------
void st_by_rows(unsigned int,unsigned int,float*), 2020-Jun-01 17:45:29, Context 1, Stream 7
Section: Command line profiler metrics
---------------------------------------------------------------------- --------------- -----------------
launch__thread_count thread 16а777а216
lts__request_tex_write_sectors_global_nonatom sector 4а200а739
---------------------------------------------------------------------- --------------- -----------------

result 3
nv-nsight-cu-cli --metrics launch__thread_count,lts__request_tex_write_sectors_global_nonatom matrix_fill.exe 8 32 4 32
void st_by_cols(unsigned int,unsigned int,float*), 2020-Jun-01 18:01:01, Context 1, Stream 7
Section: Command line profiler metrics
---------------------------------------------------------------------- --------------- -----------------
launch__thread_count thread 16а777а216
lts__request_tex_write_sectors_global_nonatom sector 8а395а043
---------------------------------------------------------------------- --------------- -----------------
void st_by_rows(unsigned int,unsigned int,float*), 2020-Jun-01 18:01:01, Context 1, Stream 7
Section: Command line profiler metrics
---------------------------------------------------------------------- --------------- -----------------
launch__thread_count thread 16а777а216
lts__request_tex_write_sectors_global_nonatom sector 4а200а739
---------------------------------------------------------------------- --------------- -----------------
roughly the same results are also reported by
nvprof --metrics l2_write_transactions .\x64\Release\matrix_fill.exe 2 32 16 32
nvprof --metrics l2_write_transactions .\x64\Release\matrix_fill.exe 4 32 8 32
nvprof --metrics l2_write_transactions .\x64\Release\matrix_fill.exe 8 32 4 32
Could someone help me answer some QUESTIONS about this results:

about result 2:
from CUDA C Best Practice Guide:
“…For devices of compute capability 6.0 or higher, the requirements can be summarized quite easily: the concurrent accesses of the threads of a warp will coalesce into a number of transactions equal to the number of 32-byte transactions necessary to service all of the threads of the warp.”

threads within warp stores to 4 32B sectors (full sector) in both cases (in different order), equal number of requests expected.
Why the reported by profiler number of requests is not the same as for result 1 as threads within warp also stores to 4 32B sectors ?
Why profiler reports twice more requests for st_by_cols kernel ?
Why the number of requests per warp is not equal to 4 for any kernel?
(st_by_cols 8а395а04332/16а777а216 = 16 requests per warp; st_by_rows 4а200а73932/16а777а216 = 8 requests per warp)

about result 3: threads within warp stores to 8 32B sectors(16 byte/sector) in both cases (in different order), equal number of transactions expected.
Why profiler reports twice more requests for st_by_cols kernel ?
Why the number of requests is the same as for result 2 when threads within warp stores twice more 32B sectors?

Topic		Replies	Views
Matrix Transposition: Why are the number of Shared Store/Load Transactions different in a CUDA Kernel? Nsight Compute	2	674	July 5, 2022
Kernel launch failure plus Warp execution performance CUDA Programming and Performance hw , cuda	13	593	May 9, 2024
Fast CUDA implementation for calculating cross-norm-distance of two matrices CUDA Programming and Performance cuda	0	916	April 8, 2021
Efficient use of shared memory CUDA Programming and Performance	29	4099	December 2, 2019
Unexplained Output from Compute Visual Profiler CUDA Programming and Performance	8	1888	July 30, 2013
Using Nsight Compute to Inspect your Kernels Technical Blog	2	1647	August 31, 2020
How to accurately time individual memory operations CUDA Programming and Performance	17	6173	September 12, 2016
Need help about global memory access by threads CUDA Programming and Performance	4	1184	April 6, 2010
Branch divergence, Boundary element exchange Optimization and best practices CUDA Programming and Performance	9	18549	December 13, 2007
shared memory performance kernel execution timings with one block CUDA Programming and Performance	3	3167	May 6, 2007

About the number of store transactions on pascal

Related topics