These are my two Kernels.One is coalesced,the other is not coalesced.But coalesced kernel execution time is longer.I think the reason for this phenomenon may be caused by the caching mechanism.I did not find the corresponding description in any of the materials.Can someone tell me what kind of mechanism the Turing architecture L1 cache and L2 cache will cause such a phenomenon?Thank you very much.
The block size is (32,16),grid size is((data_width+block.x-1)/block.x , (data_height+block.y-1)/block.y)
data_width is 2048,data_height is 1024.
My Graphics card is GeForce 2080Ti.
The following are the my kernels and results I got from using Nsight Computer to test.
coalesced kernel:
__global__ void kernel(void *src, void *dst, int height, int width)
{
extern __shared__ float shmem[];
int block_top_left_x = blockIdx.x * blockDim.x;
int block_top_left_y = blockIdx.y * blockDim.y;
for (int i = 0; i < 3; i++)
{
int index = threadIdx.x + block_top_left_x * 3 + i * blockDim.x + (threadIdx.y + block_top_left_y) * width * 3;
static_cast<Type *>(dst)[index] = static_cast<Type *>(src)[index];
}
}
instruction
sudo ncu --set full ./a.out
not coalesced kernel:
__global__ void kernel(void *src, void *dst, int height, int width, int size)
{
extern __shared__ float shmem[];
int block_top_left_x = blockIdx.x * blockDim.x;
int block_top_left_y = blockIdx.y * blockDim.y;
int index = threadIdx.x + block_top_left_x + (threadIdx.y + block_top_left_y) * width;
for (int i = 0; i < 3; i++)
{
static_cast<Type *>(dst)[index * 3 + i] = static_cast<Type *>(src)[index * 3 + i];
}
}
coalesced profile result:
==PROF== Disconnected from process 23925
[23925] oriVersionDemo@127.0.0.1
kernel(void*, void*, int, int, int), 2021-Sep-09 21:54:33, Context 1, Stream 7
Section: GPU Speed Of Light
---------------------------------------------------------------------- --------------- ------------------------------
DRAM Frequency cycle/nsecond 6.64
SM Frequency cycle/nsecond 1.30
Elapsed Cycles cycle 121,726
Memory [%] % 80.66
SOL DRAM % 80.66
Duration usecond 92.99
SOL L1/TEX Cache % 67.03
SOL L2 Cache % 58.01
SM Active Cycles cycle 116,210.72
SM [%] % 14.36
---------------------------------------------------------------------- --------------- ------------------------------
OK The kernel is utilizing greater than 80.0% of the available compute or memory performance of the device. To
further improve performance, work will likely need to be shifted from the most utilized to another unit.
Start by analyzing workloads in the Memory Workload Analysis section.
OK The ratio of peak float (fp32) to double (fp64) performance on this device is 32:1. The kernel achieved 0% of
this device's fp32 peak performance and 0% of its fp64 peak performance.
Section: Compute Workload Analysis
---------------------------------------------------------------------- --------------- ------------------------------
Executed Ipc Active inst/cycle 0.17
Executed Ipc Elapsed inst/cycle 0.16
Issue Slots Busy % 4.16
Issued Ipc Active inst/cycle 0.17
SM Busy % 6.08
---------------------------------------------------------------------- --------------- ------------------------------
WRN All pipelines are under-utilized. Either this kernel is very small or it doesn't issue enough warps per
scheduler. Check the Launch Statistics and Scheduler Statistics sections for further details.
Section: Memory Workload Analysis
---------------------------------------------------------------------- --------------- ------------------------------
Memory Throughput Gbyte/second 514.52
Mem Busy % 58.01
Max Bandwidth % 80.66
L1/TEX Hit Rate % 65.18
L2 Hit Rate % 73.11
Mem Pipes Busy % 14.36
---------------------------------------------------------------------- --------------- ------------------------------
Section: Scheduler Statistics
---------------------------------------------------------------------- --------------- ------------------------------
One or More Eligible % 4.20
Issued Warp Per Scheduler 0.04
No Eligible % 95.80
Active Warps Per Scheduler warp 6.24
Eligible Warps Per Scheduler warp 0.06
---------------------------------------------------------------------- --------------- ------------------------------
WRN Every scheduler is capable of issuing one instruction per cycle, but for this kernel each scheduler only
issues an instruction every 23.8 cycles. This might leave hardware resources underutilized and may lead to
less optimal performance. Out of the maximum of 8 warps per scheduler, this kernel allocates an average of
6.24 active warps per scheduler, but only an average of 0.06 warps were eligible per cycle. Eligible warps
are the subset of active warps that are ready to issue their next instruction. Every cycle with no eligible
warp results in no instruction being issued and the issue slot remains unused. To increase the number of
eligible warps either increase the number of active warps or reduce the time the active warps are stalled.
Section: Warp State Statistics
---------------------------------------------------------------------- --------------- ------------------------------
Warp Cycles Per Issued Instruction cycle 148.59
Warp Cycles Per Executed Instruction cycle 149.08
Avg. Active Threads Per Warp 32
Avg. Not Predicated Off Threads Per Warp 30.40
---------------------------------------------------------------------- --------------- ------------------------------
WRN On average each warp of this kernel spends 114.1 cycles being stalled waiting for a scoreboard dependency on
a L1TEX (local, global, surface, texture) operation. This represents about 76.8% of the total average of
148.6 cycles between issuing two instructions. To reduce the number of cycles waiting on L1TEX data accesses
verify the memory access patterns are optimal for the target architecture, attempt to increase cache hit
rates by increasing data locality or by changing the cache configuration, and consider moving frequently
used data to shared memory.
Section: Instruction Statistics
---------------------------------------------------------------------- --------------- ------------------------------
Avg. Executed Instructions Per Scheduler inst 4,818.82
Executed Instructions inst 1,310,720
Avg. Issued Instructions Per Scheduler inst 4,834.82
Issued Instructions inst 1,315,072
---------------------------------------------------------------------- --------------- ------------------------------
Section: Launch Statistics
---------------------------------------------------------------------- --------------- ------------------------------
Block Size 512
Grid Size 4,096
Registers Per Thread register/thread 16
Shared Memory Configuration Size Kbyte 32.77
Driver Shared Memory Per Block byte/block 0
Dynamic Shared Memory Per Block Kbyte/block 2.05
Static Shared Memory Per Block byte/block 0
Threads thread 2,097,152
Waves Per SM 30.12
---------------------------------------------------------------------- --------------- ------------------------------
Section: Occupancy
---------------------------------------------------------------------- --------------- ------------------------------
Block Limit SM block 16
Block Limit Registers block 8
Block Limit Shared Mem block 32
Block Limit Warps block 2
Theoretical Active Warps per SM warp 32
Theoretical Occupancy % 100
Achieved Occupancy % 77.84
Achieved Active Warps Per SM warp 24.91
---------------------------------------------------------------------- --------------- ------------------------------
WRN Uncoalesced global access, expected 262144 transactions, got 786432 (3.00x) at PC 0x7f746ef9c3c0
----- --------------------------------------------------------------------------------------------------------------
WRN Uncoalesced global access, expected 262144 transactions, got 786432 (3.00x) at PC 0x7f746ef9c3e0
----- --------------------------------------------------------------------------------------------------------------
WRN Uncoalesced global access, expected 262144 transactions, got 786432 (3.00x) at PC 0x7f746ef9c3f0
----- --------------------------------------------------------------------------------------------------------------
WRN Uncoalesced global access, expected 262144 transactions, got 786432 (3.00x) at PC 0x7f746ef9c400
----- --------------------------------------------------------------------------------------------------------------
WRN Uncoalesced global access, expected 262144 transactions, got 786432 (3.00x) at PC 0x7f746ef9c410
----- --------------------------------------------------------------------------------------------------------------
WRN Uncoalesced global access, expected 262144 transactions, got 786432 (3.00x) at PC 0x7f746ef9c420
coalesced version:
kernel(void*, void*, int, int), 2021-Sep-09 21:56:23, Context 1, Stream 7
Section: GPU Speed Of Light
---------------------------------------------------------------------- --------------- ------------------------------
DRAM Frequency cycle/nsecond 6.80
SM Frequency cycle/nsecond 1.34
Elapsed Cycles cycle 161,770
Memory [%] % 60.72
SOL DRAM % 60.72
Duration usecond 120.16
SOL L1/TEX Cache % 18.05
SOL L2 Cache % 23.28
SM Active Cycles cycle 144,314.88
SM [%] % 10.80
---------------------------------------------------------------------- --------------- ------------------------------
WRN Memory is more heavily utilized than Compute: Look at the Memory Workload Analysis report section to see
where the memory system bottleneck is. Check memory replay (coalescing) metrics to make sure you're
efficiently utilizing the bytes transferred. Also consider whether it is possible to do more work per memory
access (kernel fusion) or whether there are values you can (re)compute.
OK The ratio of peak float (fp32) to double (fp64) performance on this device is 32:1. The kernel achieved 0% of
this device's fp32 peak performance and 0% of its fp64 peak performance.
Section: Compute Workload Analysis
---------------------------------------------------------------------- --------------- ------------------------------
Executed Ipc Active inst/cycle 0.22
Executed Ipc Elapsed inst/cycle 0.20
Issue Slots Busy % 5.52
Issued Ipc Active inst/cycle 0.22
SM Busy % 5.52
---------------------------------------------------------------------- --------------- ------------------------------
WRN All pipelines are under-utilized. Either this kernel is very small or it doesn't issue enough warps per
scheduler. Check the Launch Statistics and Scheduler Statistics sections for further details.
Section: Memory Workload Analysis
---------------------------------------------------------------------- --------------- ------------------------------
Memory Throughput Gbyte/second 396.14
Mem Busy % 23.28
Max Bandwidth % 60.72
L1/TEX Hit Rate % 0
L2 Hit Rate % 50.06
Mem Pipes Busy % 10.80
---------------------------------------------------------------------- --------------- ------------------------------
Section: Scheduler Statistics
---------------------------------------------------------------------- --------------- ------------------------------
One or More Eligible % 5.62
Issued Warp Per Scheduler 0.06
No Eligible % 94.38
Active Warps Per Scheduler warp 3.72
Eligible Warps Per Scheduler warp 0.08
---------------------------------------------------------------------- --------------- ------------------------------
WRN Every scheduler is capable of issuing one instruction per cycle, but for this kernel each scheduler only
issues an instruction every 17.8 cycles. This might leave hardware resources underutilized and may lead to
less optimal performance. Out of the maximum of 8 warps per scheduler, this kernel allocates an average of
3.72 active warps per scheduler, but only an average of 0.08 warps were eligible per cycle. Eligible warps
are the subset of active warps that are ready to issue their next instruction. Every cycle with no eligible
warp results in no instruction being issued and the issue slot remains unused. To increase the number of
eligible warps either increase the number of active warps or reduce the time the active warps are stalled.
Section: Warp State Statistics
---------------------------------------------------------------------- --------------- ------------------------------
Warp Cycles Per Issued Instruction cycle 66.29
Warp Cycles Per Executed Instruction cycle 66.37
Avg. Active Threads Per Warp 32
Avg. Not Predicated Off Threads Per Warp 31.03
---------------------------------------------------------------------- --------------- ------------------------------
WRN On average each warp of this kernel spends 59.1 cycles being stalled waiting for a scoreboard dependency on a
L1TEX (local, global, surface, texture) operation. This represents about 89.2% of the total average of 66.3
cycles between issuing two instructions. To reduce the number of cycles waiting on L1TEX data accesses
verify the memory access patterns are optimal for the target architecture, attempt to increase cache hit
rates by increasing data locality or by changing the cache configuration, and consider moving frequently
used data to shared memory.
Section: Instruction Statistics
---------------------------------------------------------------------- --------------- ------------------------------
Avg. Executed Instructions Per Scheduler inst 7,951.06
Executed Instructions inst 2,162,688
Avg. Issued Instructions Per Scheduler inst 7,960.06
Issued Instructions inst 2,165,136
---------------------------------------------------------------------- --------------- ------------------------------
Section: Launch Statistics
---------------------------------------------------------------------- --------------- ------------------------------
Block Size 512
Grid Size 4,096
Registers Per Thread register/thread 22
Shared Memory Configuration Size Kbyte 65.54
Driver Shared Memory Per Block byte/block 0
Dynamic Shared Memory Per Block Kbyte/block 49.15
Static Shared Memory Per Block byte/block 0
Threads thread 2,097,152
Waves Per SM 60.24
---------------------------------------------------------------------- --------------- ------------------------------
Section: Occupancy
---------------------------------------------------------------------- --------------- ------------------------------
Block Limit SM block 16
Block Limit Registers block 5
Block Limit Shared Mem block 1
Block Limit Warps block 2
Theoretical Active Warps per SM warp 16
Theoretical Occupancy % 50
Achieved Occupancy % 46.00
Achieved Active Warps Per SM warp 14.72
---------------------------------------------------------------------- --------------- ------------------------------
efficiency and transaction:
instruction:
sudo ncu --metrics smsp__sass_average_data_bytes_per_sector_mem_global_op_st.pct,l1tex__t_sectors_pipe_lsu_mem_global_op_st.sum ./a.out
not coalesced Version:
kernel(void*, void*, int, int, int), 2021-Sep-09 21:57:49, Context 1, Stream 7
Section: Command line profiler metrics
---------------------------------------------------------------------- --------------- ------------------------------
l1tex__t_sectors_pipe_lsu_mem_global_op_st.sum sector 2,359,296
smsp__sass_average_data_bytes_per_sector_mem_global_op_st.pct % 33.33
---------------------------------------------------------------------- --------------- ------------------------------
coalesced Version:
kernel(void*, void*, int, int), 2021-Sep-09 21:58:54, Context 1, Stream 7
Section: Command line profiler metrics
---------------------------------------------------------------------- --------------- ------------------------------
l1tex__t_sectors_pipe_lsu_mem_global_op_st.sum sector 786,432
smsp__sass_average_data_bytes_per_sector_mem_global_op_st.pct % 100.53
---------------------------------------------------------------------- --------------- ------------------------------