How does Turing architecture's L1 cache and L2 cache work?

These are my two Kernels.One is coalesced,the other is not coalesced.But coalesced kernel execution time is longer.I think the reason for this phenomenon may be caused by the caching mechanism.I did not find the corresponding description in any of the materials.Can someone tell me what kind of mechanism the Turing architecture L1 cache and L2 cache will cause such a phenomenon?Thank you very much.

The block size is (32,16),grid size is((data_width+block.x-1)/block.x , (data_height+block.y-1)/block.y)
data_width is 2048,data_height is 1024.

My Graphics card is GeForce 2080Ti.

The following are the my kernels and results I got from using Nsight Computer to test.

coalesced kernel:

__global__ void kernel(void *src, void *dst, int height, int width)
{
    extern __shared__ float shmem[];
    int block_top_left_x = blockIdx.x * blockDim.x;
    int block_top_left_y = blockIdx.y * blockDim.y;

    for (int i = 0; i < 3; i++)
    {
        int index = threadIdx.x + block_top_left_x * 3 + i * blockDim.x + (threadIdx.y + block_top_left_y) * width * 3;
        static_cast<Type *>(dst)[index] = static_cast<Type *>(src)[index];
    }
}

instruction

sudo ncu --set full ./a.out

not coalesced kernel:

__global__ void kernel(void *src, void *dst, int height, int width, int size)
{
    extern __shared__ float shmem[];
    int block_top_left_x = blockIdx.x * blockDim.x;
    int block_top_left_y = blockIdx.y * blockDim.y;

    int index = threadIdx.x + block_top_left_x + (threadIdx.y + block_top_left_y) * width;
    for (int i = 0; i < 3; i++)
    {
        static_cast<Type *>(dst)[index * 3 + i] = static_cast<Type *>(src)[index * 3 + i];
    }
}

coalesced profile result:

==PROF== Disconnected from process 23925
[23925] oriVersionDemo@127.0.0.1
  kernel(void*, void*, int, int, int), 2021-Sep-09 21:54:33, Context 1, Stream 7
    Section: GPU Speed Of Light
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           6.64
    SM Frequency                                                             cycle/nsecond                           1.30
    Elapsed Cycles                                                                   cycle                        121,726
    Memory [%]                                                                           %                          80.66
    SOL DRAM                                                                             %                          80.66
    Duration                                                                       usecond                          92.99
    SOL L1/TEX Cache                                                                     %                          67.03
    SOL L2 Cache                                                                         %                          58.01
    SM Active Cycles                                                                 cycle                     116,210.72
    SM [%]                                                                               %                          14.36
    ---------------------------------------------------------------------- --------------- ------------------------------
    OK    The kernel is utilizing greater than 80.0% of the available compute or memory performance of the device. To   
          further improve performance, work will likely need to be shifted from the most utilized to another unit.      
          Start by analyzing workloads in the Memory Workload Analysis section.                                         

    OK    The ratio of peak float (fp32) to double (fp64) performance on this device is 32:1. The kernel achieved 0% of 
          this device's fp32 peak performance and 0% of its fp64 peak performance.                                      

    Section: Compute Workload Analysis
    ---------------------------------------------------------------------- --------------- ------------------------------
    Executed Ipc Active                                                         inst/cycle                           0.17
    Executed Ipc Elapsed                                                        inst/cycle                           0.16
    Issue Slots Busy                                                                     %                           4.16
    Issued Ipc Active                                                           inst/cycle                           0.17
    SM Busy                                                                              %                           6.08
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   All pipelines are under-utilized. Either this kernel is very small or it doesn't issue enough warps per       
          scheduler. Check the Launch Statistics and Scheduler Statistics sections for further details.                 

    Section: Memory Workload Analysis
    ---------------------------------------------------------------------- --------------- ------------------------------
    Memory Throughput                                                         Gbyte/second                         514.52
    Mem Busy                                                                             %                          58.01
    Max Bandwidth                                                                        %                          80.66
    L1/TEX Hit Rate                                                                      %                          65.18
    L2 Hit Rate                                                                          %                          73.11
    Mem Pipes Busy                                                                       %                          14.36
    ---------------------------------------------------------------------- --------------- ------------------------------

    Section: Scheduler Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    One or More Eligible                                                                 %                           4.20
    Issued Warp Per Scheduler                                                                                        0.04
    No Eligible                                                                          %                          95.80
    Active Warps Per Scheduler                                                        warp                           6.24
    Eligible Warps Per Scheduler                                                      warp                           0.06
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   Every scheduler is capable of issuing one instruction per cycle, but for this kernel each scheduler only      
          issues an instruction every 23.8 cycles. This might leave hardware resources underutilized and may lead to    
          less optimal performance. Out of the maximum of 8 warps per scheduler, this kernel allocates an average of    
          6.24 active warps per scheduler, but only an average of 0.06 warps were eligible per cycle. Eligible warps    
          are the subset of active warps that are ready to issue their next instruction. Every cycle with no eligible   
          warp results in no instruction being issued and the issue slot remains unused. To increase the number of      
          eligible warps either increase the number of active warps or reduce the time the active warps are stalled.    

    Section: Warp State Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Warp Cycles Per Issued Instruction                                               cycle                         148.59
    Warp Cycles Per Executed Instruction                                             cycle                         149.08
    Avg. Active Threads Per Warp                                                                                       32
    Avg. Not Predicated Off Threads Per Warp                                                                        30.40
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   On average each warp of this kernel spends 114.1 cycles being stalled waiting for a scoreboard dependency on  
          a L1TEX (local, global, surface, texture) operation. This represents about 76.8% of the total average of      
          148.6 cycles between issuing two instructions. To reduce the number of cycles waiting on L1TEX data accesses  
          verify the memory access patterns are optimal for the target architecture, attempt to increase cache hit      
          rates by increasing data locality or by changing the cache configuration, and consider moving frequently      
          used data to shared memory.                                                                                   

    Section: Instruction Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Avg. Executed Instructions Per Scheduler                                          inst                       4,818.82
    Executed Instructions                                                             inst                      1,310,720
    Avg. Issued Instructions Per Scheduler                                            inst                       4,834.82
    Issued Instructions                                                               inst                      1,315,072
    ---------------------------------------------------------------------- --------------- ------------------------------

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        512
    Grid Size                                                                                                       4,096
    Registers Per Thread                                                   register/thread                             16
    Shared Memory Configuration Size                                                 Kbyte                          32.77
    Driver Shared Memory Per Block                                              byte/block                              0
    Dynamic Shared Memory Per Block                                            Kbyte/block                           2.05
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                      2,097,152
    Waves Per SM                                                                                                    30.12
    ---------------------------------------------------------------------- --------------- ------------------------------

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              8
    Block Limit Shared Mem                                                           block                             32
    Block Limit Warps                                                                block                              2
    Theoretical Active Warps per SM                                                   warp                             32
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                          77.84
    Achieved Active Warps Per SM                                                      warp                          24.91
    ---------------------------------------------------------------------- --------------- ------------------------------

    WRN   Uncoalesced global access, expected 262144 transactions, got 786432 (3.00x) at PC 0x7f746ef9c3c0              
    ----- --------------------------------------------------------------------------------------------------------------
    WRN   Uncoalesced global access, expected 262144 transactions, got 786432 (3.00x) at PC 0x7f746ef9c3e0              
    ----- --------------------------------------------------------------------------------------------------------------
    WRN   Uncoalesced global access, expected 262144 transactions, got 786432 (3.00x) at PC 0x7f746ef9c3f0              
    ----- --------------------------------------------------------------------------------------------------------------
    WRN   Uncoalesced global access, expected 262144 transactions, got 786432 (3.00x) at PC 0x7f746ef9c400              
    ----- --------------------------------------------------------------------------------------------------------------
    WRN   Uncoalesced global access, expected 262144 transactions, got 786432 (3.00x) at PC 0x7f746ef9c410              
    ----- --------------------------------------------------------------------------------------------------------------
    WRN   Uncoalesced global access, expected 262144 transactions, got 786432 (3.00x) at PC 0x7f746ef9c420      

coalesced version:

kernel(void*, void*, int, int), 2021-Sep-09 21:56:23, Context 1, Stream 7
    Section: GPU Speed Of Light
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           6.80
    SM Frequency                                                             cycle/nsecond                           1.34
    Elapsed Cycles                                                                   cycle                        161,770
    Memory [%]                                                                           %                          60.72
    SOL DRAM                                                                             %                          60.72
    Duration                                                                       usecond                         120.16
    SOL L1/TEX Cache                                                                     %                          18.05
    SOL L2 Cache                                                                         %                          23.28
    SM Active Cycles                                                                 cycle                     144,314.88
    SM [%]                                                                               %                          10.80
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   Memory is more heavily utilized than Compute: Look at the Memory Workload Analysis report section to see      
          where the memory system bottleneck is. Check memory replay (coalescing) metrics to make sure you're           
          efficiently utilizing the bytes transferred. Also consider whether it is possible to do more work per memory  
          access (kernel fusion) or whether there are values you can (re)compute.                                       

    OK    The ratio of peak float (fp32) to double (fp64) performance on this device is 32:1. The kernel achieved 0% of 
          this device's fp32 peak performance and 0% of its fp64 peak performance.                                      

    Section: Compute Workload Analysis
    ---------------------------------------------------------------------- --------------- ------------------------------
    Executed Ipc Active                                                         inst/cycle                           0.22
    Executed Ipc Elapsed                                                        inst/cycle                           0.20
    Issue Slots Busy                                                                     %                           5.52
    Issued Ipc Active                                                           inst/cycle                           0.22
    SM Busy                                                                              %                           5.52
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   All pipelines are under-utilized. Either this kernel is very small or it doesn't issue enough warps per       
          scheduler. Check the Launch Statistics and Scheduler Statistics sections for further details.                 

    Section: Memory Workload Analysis
    ---------------------------------------------------------------------- --------------- ------------------------------
    Memory Throughput                                                         Gbyte/second                         396.14
    Mem Busy                                                                             %                          23.28
    Max Bandwidth                                                                        %                          60.72
    L1/TEX Hit Rate                                                                      %                              0
    L2 Hit Rate                                                                          %                          50.06
    Mem Pipes Busy                                                                       %                          10.80
    ---------------------------------------------------------------------- --------------- ------------------------------

    Section: Scheduler Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    One or More Eligible                                                                 %                           5.62
    Issued Warp Per Scheduler                                                                                        0.06
    No Eligible                                                                          %                          94.38
    Active Warps Per Scheduler                                                        warp                           3.72
    Eligible Warps Per Scheduler                                                      warp                           0.08
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   Every scheduler is capable of issuing one instruction per cycle, but for this kernel each scheduler only      
          issues an instruction every 17.8 cycles. This might leave hardware resources underutilized and may lead to    
          less optimal performance. Out of the maximum of 8 warps per scheduler, this kernel allocates an average of    
          3.72 active warps per scheduler, but only an average of 0.08 warps were eligible per cycle. Eligible warps    
          are the subset of active warps that are ready to issue their next instruction. Every cycle with no eligible   
          warp results in no instruction being issued and the issue slot remains unused. To increase the number of      
          eligible warps either increase the number of active warps or reduce the time the active warps are stalled.    

    Section: Warp State Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Warp Cycles Per Issued Instruction                                               cycle                          66.29
    Warp Cycles Per Executed Instruction                                             cycle                          66.37
    Avg. Active Threads Per Warp                                                                                       32
    Avg. Not Predicated Off Threads Per Warp                                                                        31.03
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   On average each warp of this kernel spends 59.1 cycles being stalled waiting for a scoreboard dependency on a 
          L1TEX (local, global, surface, texture) operation. This represents about 89.2% of the total average of 66.3   
          cycles between issuing two instructions. To reduce the number of cycles waiting on L1TEX data accesses        
          verify the memory access patterns are optimal for the target architecture, attempt to increase cache hit      
          rates by increasing data locality or by changing the cache configuration, and consider moving frequently      
          used data to shared memory.                                                                                   

    Section: Instruction Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Avg. Executed Instructions Per Scheduler                                          inst                       7,951.06
    Executed Instructions                                                             inst                      2,162,688
    Avg. Issued Instructions Per Scheduler                                            inst                       7,960.06
    Issued Instructions                                                               inst                      2,165,136
    ---------------------------------------------------------------------- --------------- ------------------------------

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        512
    Grid Size                                                                                                       4,096
    Registers Per Thread                                                   register/thread                             22
    Shared Memory Configuration Size                                                 Kbyte                          65.54
    Driver Shared Memory Per Block                                              byte/block                              0
    Dynamic Shared Memory Per Block                                            Kbyte/block                          49.15
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                      2,097,152
    Waves Per SM                                                                                                    60.24
    ---------------------------------------------------------------------- --------------- ------------------------------

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              5
    Block Limit Shared Mem                                                           block                              1
    Block Limit Warps                                                                block                              2
    Theoretical Active Warps per SM                                                   warp                             16
    Theoretical Occupancy                                                                %                             50
    Achieved Occupancy                                                                   %                          46.00
    Achieved Active Warps Per SM                                                      warp                          14.72
    ---------------------------------------------------------------------- --------------- ------------------------------

efficiency and transaction:

instruction:

sudo ncu --metrics smsp__sass_average_data_bytes_per_sector_mem_global_op_st.pct,l1tex__t_sectors_pipe_lsu_mem_global_op_st.sum ./a.out

not coalesced Version:

  kernel(void*, void*, int, int, int), 2021-Sep-09 21:57:49, Context 1, Stream 7
    Section: Command line profiler metrics
    ---------------------------------------------------------------------- --------------- ------------------------------
    l1tex__t_sectors_pipe_lsu_mem_global_op_st.sum                                  sector                      2,359,296
    smsp__sass_average_data_bytes_per_sector_mem_global_op_st.pct                        %                          33.33
    ---------------------------------------------------------------------- --------------- ------------------------------

coalesced Version:

  kernel(void*, void*, int, int), 2021-Sep-09 21:58:54, Context 1, Stream 7
    Section: Command line profiler metrics
    ---------------------------------------------------------------------- --------------- ------------------------------
    l1tex__t_sectors_pipe_lsu_mem_global_op_st.sum                                  sector                        786,432
    smsp__sass_average_data_bytes_per_sector_mem_global_op_st.pct                        %                         100.53
    ---------------------------------------------------------------------- --------------- ------------------------------

what is Type ?

I’m guessing it is a byte-sized type.

While a T4 is not a 2080ti, this paper may be useful, " Dissecting the NVidia Turing T4 GPU via Microbenchmarking":

yeah, “unsigned char”

First, I’d like to say that this is, in my opinion, absolutely bizarre code for what amounts to a simple 2D copy-kernel. Given that, I’ve done only very limited analysis and conjecture.

My theory is that you are exploring the effect of the L2 cacheline. My mental model of cachelines is either 128 byte or 32 byte for L1, and 32 byte for L2, but if you look at the document helpfully attached by rs277, you’ll note that they experimentally measure the L2 cacheline of TU104 (and V100) at 64 bytes (page 22, table 3.1) In a nutshell the significance is as follows. The slower (coalesced) kernel loads 64 bytes (because that is what happens on an L2 miss) but only processes 32 of those:

    for (int i = 0; i < 3; i++)
    {
        int index = threadIdx.x + block_top_left_x * 3 + i * blockDim.x + (threadIdx.y + block_top_left_y) * width * 3;
        static_cast<Type *>(dst)[index] = static_cast<Type *>(src)[index];
    }

When i is zero, we load 64 bytes but only process 32. When i is one, we stride far away from the previous load, therefore loading another 64 bytes but only processing 32. The faster (non-coalesced) kernel loads 128 bytes (at first) then processes 96 of those:

    for (int i = 0; i < 3; i++)
    {
        static_cast<Type *>(dst)[index * 3 + i] = static_cast<Type *>(src)[index * 3 + i];
    }

Now, according to my testing this copy kernel isn’t doing any extra work, and isn’t missing any copying of bytes, so if the dataset size is comparable to the L2 cache size, this kind of “inefficiency” isn’t going to matter much. Eventually those loaded L2 cachelines will get processed, and not much gets evicted before that.

On my V100 that is pretty much what I see:

$ cat t1888.cu
#include <iostream>
typedef unsigned char Type;

__global__ void kernel_c(void *src, void *dst, int height, int width)
{
    int block_top_left_x = blockIdx.x * blockDim.x;
    int block_top_left_y = blockIdx.y * blockDim.y;

    for (int i = 0; i < 3; i++)
    {
        int index = threadIdx.x + block_top_left_x * 3 + i * blockDim.x + (threadIdx.y + block_top_left_y) * width * 3;
        static_cast<Type *>(dst)[index] = static_cast<Type *>(src)[index];
    }
}


__global__ void kernel_nc(void *src, void *dst, int height, int width)
{
    int block_top_left_x = blockIdx.x * blockDim.x;
    int block_top_left_y = blockIdx.y * blockDim.y;

    int index = threadIdx.x + block_top_left_x + (threadIdx.y + block_top_left_y) * width;
    for (int i = 0; i < 3; i++)
    {
        static_cast<Type *>(dst)[index * 3 + i] = static_cast<Type *>(src)[index * 3 + i];
    }
}

int main(){
  typedef uchar3 st;
  const int data_width=2048;
  const int data_height=1024;
  dim3 block(32,16);
  dim3 grid((data_width+block.x-1)/block.x , (data_height+block.y-1)/block.y);
  st *src, *dst;
  Type *h1, *h2;
  h1 = new Type[3*data_width*data_height];
  h2 = new Type[3*data_width*data_height];
  cudaMalloc(&src, data_width*data_height*sizeof(st));
  cudaMalloc(&dst, data_width*data_height*sizeof(st));
  cudaMemset(src, 1, 3*data_width*data_height*sizeof(Type));
  cudaMemset(dst, 0, 3*data_width*data_height*sizeof(Type));
  kernel_c<<<grid, block>>>(src, dst, data_height, data_width);
  cudaMemcpy(h1, dst, 3*data_width*data_height*sizeof(Type), cudaMemcpyDeviceToHost);
  for (int i = 0; i < data_width*data_height*3; i++) if (h1[i] != 0x01) {std::cout << "mismatch at: " << i << " " << h1[i] << " " << h2[i] << std::endl; return 0;}
  kernel_c<<<grid, block>>>(src, dst, data_height, data_width);
  cudaMemset(dst, 0, 3*data_width*data_height*sizeof(Type));
  kernel_nc<<<grid, block>>>(src, dst, data_height, data_width);
  cudaMemcpy(h2, dst, 3*data_width*data_height*sizeof(Type), cudaMemcpyDeviceToHost);
  kernel_nc<<<grid, block>>>(src, dst, data_height, data_width);
  cudaDeviceSynchronize();
  for (int i = 0; i < data_width*data_height*3; i++) if (h1[i] != h2[i]) {std::cout << "mismatch at: " << i << " " << h1[i] << " " << h2[i] << std::endl; return 0;}
}
$ nvcc -o t1888 t1888.cu
$ nvprof ./t1888
==20787== NVPROF is profiling process 20787, command: ./t1888
==20787== Profiling application: ./t1888
==20787== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   96.20%  4.1832ms         2  2.0916ms  2.0528ms  2.1304ms  [CUDA memcpy DtoH]
                    1.61%  69.826us         2  34.913us  32.257us  37.569us  kernel_c(void*, void*, int, int)
                    1.47%  63.970us         2  31.985us  30.081us  33.889us  kernel_nc(void*, void*, int, int)
                    0.72%  31.328us         3  10.442us  9.0240us  12.288us  [CUDA memset]
      API calls:   95.89%  326.72ms         2  163.36ms  199.40us  326.52ms  cudaMalloc
                    1.98%  6.7458ms         2  3.3729ms  3.3274ms  3.4184ms  cudaMemcpy
                    1.30%  4.4143ms         4  1.1036ms  506.83us  2.8732ms  cuDeviceTotalMem
                    0.66%  2.2354ms       404  5.5330us     340ns  246.61us  cuDeviceGetAttribute
                    0.10%  354.98us         4  88.744us  53.975us  190.95us  cuDeviceGetName
                    0.03%  98.233us         3  32.744us  12.354us  72.771us  cudaMemset
                    0.02%  81.767us         4  20.441us  10.054us  28.668us  cudaLaunchKernel
                    0.01%  36.837us         1  36.837us  36.837us  36.837us  cudaDeviceSynchronize
                    0.01%  21.209us         4  5.3020us  3.2670us  7.5900us  cuDeviceGetPCIBusId
                    0.00%  12.084us         8  1.5100us     357ns  4.8880us  cuDeviceGet
                    0.00%  3.3610us         3  1.1200us     683ns  1.5180us  cuDeviceGetCount
                    0.00%  2.7130us         4     678ns     617ns     814ns  cuDeviceGetUuid
$

The V100 L2 is 6MB, and this data set is 2048x1024x3 = 6MB. Yes, the non-coalesced kernel is faster but the difference is less than 10%. The L2 cache is “fixing” things for me due to its size. Your observation probably is different if your Turing L2 cache is significantly different than 6MB. So let’s explore that.

What if we make the data set size much bigger than the L2 cache? Now the difference becomes somewhat more pronounced (only thing I did different here is to change the 2048 dimension to 20480, so 60MB data set size):

$ cat t1888.cu
#include <iostream>
typedef unsigned char Type;

__global__ void kernel_c(void *src, void *dst, int height, int width)
{
    int block_top_left_x = blockIdx.x * blockDim.x;
    int block_top_left_y = blockIdx.y * blockDim.y;

    for (int i = 0; i < 3; i++)
    {
        int index = threadIdx.x + block_top_left_x * 3 + i * blockDim.x + (threadIdx.y + block_top_left_y) * width * 3;
        static_cast<Type *>(dst)[index] = static_cast<Type *>(src)[index];
    }
}


__global__ void kernel_nc(void *src, void *dst, int height, int width)
{
    int block_top_left_x = blockIdx.x * blockDim.x;
    int block_top_left_y = blockIdx.y * blockDim.y;

    int index = threadIdx.x + block_top_left_x + (threadIdx.y + block_top_left_y) * width;
    for (int i = 0; i < 3; i++)
    {
        static_cast<Type *>(dst)[index * 3 + i] = static_cast<Type *>(src)[index * 3 + i];
    }
}

int main(){
  typedef uchar3 st;
  const int data_width=20480;
  const int data_height=1024;
  dim3 block(32,16);
  dim3 grid((data_width+block.x-1)/block.x , (data_height+block.y-1)/block.y);
  st *src, *dst;
  Type *h1, *h2;
  h1 = new Type[3*data_width*data_height];
  h2 = new Type[3*data_width*data_height];
  cudaMalloc(&src, data_width*data_height*sizeof(st));
  cudaMalloc(&dst, data_width*data_height*sizeof(st));
  cudaMemset(src, 1, 3*data_width*data_height*sizeof(Type));
  cudaMemset(dst, 0, 3*data_width*data_height*sizeof(Type));
  kernel_c<<<grid, block>>>(src, dst, data_height, data_width);
  cudaMemcpy(h1, dst, 3*data_width*data_height*sizeof(Type), cudaMemcpyDeviceToHost);
  for (int i = 0; i < data_width*data_height*3; i++) if (h1[i] != 0x01) {std::cout << "mismatch at: " << i << " " << h1[i] << " " << h2[i] << std::endl; return 0;}
  kernel_c<<<grid, block>>>(src, dst, data_height, data_width);
  cudaMemset(dst, 0, 3*data_width*data_height*sizeof(Type));
  kernel_nc<<<grid, block>>>(src, dst, data_height, data_width);
  cudaMemcpy(h2, dst, 3*data_width*data_height*sizeof(Type), cudaMemcpyDeviceToHost);
  kernel_nc<<<grid, block>>>(src, dst, data_height, data_width);
  cudaDeviceSynchronize();
  for (int i = 0; i < data_width*data_height*3; i++) if (h1[i] != h2[i]) {std::cout << "mismatch at: " << i << " " << h1[i] << " " << h2[i] << std::endl; return 0;}
}
$ nvcc -o t1888 t1888.cu
$ nvprof ./t1888
==20889== NVPROF is profiling process 20889, command: ./t1888
==20889== Profiling application: ./t1888
==20889== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   97.49%  55.756ms         2  27.878ms  27.802ms  27.954ms  [CUDA memcpy DtoH]
                    1.16%  663.18us         2  331.59us  330.50us  332.68us  kernel_c(void*, void*, int, int)
                    0.96%  549.77us         2  274.89us  274.57us  275.21us  kernel_nc(void*, void*, int, int)
                    0.39%  224.52us         3  74.839us  73.697us  76.323us  [CUDA memset]
      API calls:   81.91%  301.31ms         2  150.65ms  283.47us  301.02ms  cudaMalloc
                   15.85%  58.304ms         2  29.152ms  28.984ms  29.320ms  cudaMemcpy
                    1.35%  4.9799ms         4  1.2450ms  586.36us  3.1946ms  cuDeviceTotalMem
                    0.65%  2.3996ms       404  5.9390us     257ns  255.54us  cuDeviceGetAttribute
                    0.08%  285.97us         1  285.97us  285.97us  285.97us  cudaDeviceSynchronize
                    0.07%  262.82us         4  65.706us  57.620us  85.240us  cuDeviceGetName
                    0.04%  142.26us         4  35.565us  12.430us  86.821us  cudaLaunchKernel
                    0.03%  117.22us         3  39.074us  12.752us  71.752us  cudaMemset
                    0.01%  20.992us         4  5.2480us  3.1950us  7.8190us  cuDeviceGetPCIBusId
                    0.00%  11.067us         8  1.3830us     434ns  4.7270us  cuDeviceGet
                    0.00%  4.8840us         3  1.6280us     673ns  2.6940us  cuDeviceGetCount
                    0.00%  2.9810us         4     745ns     637ns     874ns  cuDeviceGetUuid
$

The non-coalesced kernel here is 20% faster. Things are definitely getting evicted with this larger data set size. Any strategy that processes more of the fetched data before it gets evicted is going to be favored.

How could we fix this? Well, the way I would recommend is to use a simple copy indexing that processes stuff in a nicely contiguous fashion. A grid stride loop would be a good choice, taking adjacent, contiguous bites of your data and moving it piecemeal. Another approach would be to make sure we use more or most of what we load, when we load it. We could do this by changing Type to a larger quantity, such as int. Now we see the expected situation where the coalesced kernel is faster:

$ cat t1888.cu
#include <iostream>
typedef int Type;

__global__ void kernel_c(void *src, void *dst, int height, int width)
{
    int block_top_left_x = blockIdx.x * blockDim.x;
    int block_top_left_y = blockIdx.y * blockDim.y;

    for (int i = 0; i < 3; i++)
    {
        int index = threadIdx.x + block_top_left_x * 3 + i * blockDim.x + (threadIdx.y + block_top_left_y) * width * 3;
        static_cast<Type *>(dst)[index] = static_cast<Type *>(src)[index];
    }
}


__global__ void kernel_nc(void *src, void *dst, int height, int width)
{
    int block_top_left_x = blockIdx.x * blockDim.x;
    int block_top_left_y = blockIdx.y * blockDim.y;

    int index = threadIdx.x + block_top_left_x + (threadIdx.y + block_top_left_y) * width;
    for (int i = 0; i < 3; i++)
    {
        static_cast<Type *>(dst)[index * 3 + i] = static_cast<Type *>(src)[index * 3 + i];
    }
}

int main(){
  typedef uint3 st;
  const int data_width=20480;
  const int data_height=1024;
  dim3 block(32,16);
  dim3 grid((data_width+block.x-1)/block.x , (data_height+block.y-1)/block.y);
  st *src, *dst;
  Type *h1, *h2;
  h1 = new Type[3*data_width*data_height];
  h2 = new Type[3*data_width*data_height];
  cudaMalloc(&src, data_width*data_height*sizeof(st));
  cudaMalloc(&dst, data_width*data_height*sizeof(st));
  cudaMemset(src, 1, 3*data_width*data_height*sizeof(Type));
  cudaMemset(dst, 0, 3*data_width*data_height*sizeof(Type));
  kernel_c<<<grid, block>>>(src, dst, data_height, data_width);
  cudaMemcpy(h1, dst, 3*data_width*data_height*sizeof(Type), cudaMemcpyDeviceToHost);
  for (int i = 0; i < data_width*data_height*3; i++) if (h1[i] != 0x01010101) {std::cout << "mismatch at: " << i << " " << h1[i] << " " << h2[i] << std::endl; return 0;}
  kernel_c<<<grid, block>>>(src, dst, data_height, data_width);
  cudaMemset(dst, 0, 3*data_width*data_height*sizeof(Type));
  kernel_nc<<<grid, block>>>(src, dst, data_height, data_width);
  cudaMemcpy(h2, dst, 3*data_width*data_height*sizeof(Type), cudaMemcpyDeviceToHost);
  kernel_nc<<<grid, block>>>(src, dst, data_height, data_width);
  cudaDeviceSynchronize();
  for (int i = 0; i < data_width*data_height*3; i++) if (h1[i] != h2[i]) {std::cout << "mismatch at: " << i << " " << h1[i] << " " << h2[i] << std::endl; return 0;}
}
$ nvcc -o t1888 t1888.cu
$ nvprof ./t1888
==21028== NVPROF is profiling process 21028, command: ./t1888
==21028== Profiling application: ./t1888
==21028== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   98.20%  220.37ms         2  110.19ms  109.81ms  110.56ms  [CUDA memcpy DtoH]
                    0.82%  1.8503ms         2  925.14us  920.92us  929.37us  kernel_nc(void*, void*, int, int)
                    0.59%  1.3267ms         2  663.36us  654.80us  671.92us  kernel_c(void*, void*, int, int)
                    0.38%  861.27us         3  287.09us  286.06us  288.01us  [CUDA memset]
      API calls:   59.13%  339.34ms         2  169.67ms  525.83us  338.82ms  cudaMalloc
                   39.26%  225.34ms         2  112.67ms  111.98ms  113.35ms  cudaMemcpy
                    0.88%  5.0669ms         4  1.2667ms  592.84us  3.2554ms  cuDeviceTotalMem
                    0.44%  2.5009ms       404  6.1900us     383ns  273.02us  cuDeviceGetAttribute
                    0.16%  943.76us         1  943.76us  943.76us  943.76us  cudaDeviceSynchronize
                    0.07%  424.71us         4  106.18us  60.600us  241.57us  cuDeviceGetName
                    0.02%  142.66us         4  35.665us  11.589us  83.135us  cudaLaunchKernel
                    0.02%  116.24us         3  38.746us  12.512us  71.497us  cudaMemset
                    0.00%  21.630us         4  5.4070us  2.7090us  12.057us  cuDeviceGetPCIBusId
                    0.00%  11.957us         8  1.4940us     427ns  5.4300us  cuDeviceGet
                    0.00%  3.1100us         4     777ns     670ns     957ns  cuDeviceGetUuid
                    0.00%  2.6250us         3     875ns     422ns  1.2180us  cuDeviceGetCount
$

By switching from one byte per thread to 4 bytes per thread, we have caused the coalesced kernel to load 128 bytes (exactly two L2 cachelines, lets pretend) and process 128 bytes. That strikes me as better, and the numbers seem to support it.

But rather than try and work real hard on this, my advice would be to dump this bizarre code and indexing, and copy things simply, linearly, contiguously.

I am just new to CUDA programming, and do not have a very deep understanding of computer architecture. I conducted such an experiment, mainly to verify the memory access merge optimization code, but the performance shown by the non-merged kernel is not as good as the merged one. The kernel function, this deviates from my perception. I think the merged memory access should show better performance, so I made a series of guesses, and attributed the reason for this result to whether it might be caused by some automatic optimization strategy of Cache?
According to your test, the time displayed by the int type, the execution time of the merged version of the kernel function is faster than the execution time of the non-merged version of the kernel function, but I still show that the execution time of the non-merged version of the kernel function is faster. Why is this? I still don’t quite understand what the L1 cache and L2 cache work on Turing architecture. According to my thoughts, the combined memory access should have higher performance and the execution time of the kernel function should be shorter. But why does the result show that the execution time of the non-combined version of the kernel function is shorter?

// non-coalesced  code and my nvprof result:
#define Type float
inline void RandomFloatVector(std::vector<float> &vec_f)
{
    srand((unsigned int)(time(NULL)));
    std::for_each(vec_f.begin(), vec_f.end(), [](float &f) { f = (rand() % 255) / 255.f; });
}

__global__ void kernel(void *src, void *dst, int height, int width, int size)
{
    int block_top_left_x = blockIdx.x * blockDim.x;
    int block_top_left_y = blockIdx.y * blockDim.y;

    int index = threadIdx.x + block_top_left_x + (threadIdx.y + block_top_left_y) * width;
    for (int i = 0; i < 3; i++)
    {
        static_cast<Type *>(dst)[index * 3 + i] = static_cast<Type *>(src)[index * 3 + i];
    }
}

int main(int argv, char **argc)
{
    void *src, *dst;
    int height = 1024, width = 20480;
    int size = height * width * 3;
    std::vector<Type> input(size);
    RandomFloatVector(input);
    CUDA_CHECK(cudaMalloc(&src, size * sizeof(Type)));
    CUDA_CHECK(cudaMalloc(&dst, size * sizeof(Type)));
    CUDA_CHECK(cudaMemcpy(src, input.data(), size * sizeof(Type), cudaMemcpyHostToDevice));

    dim3 block(32, 16);
    int grid_x = (width + block.x - 1) / block.x;
    int grid_y = (height + block.y - 1) / block.y;
    dim3 grid(grid_x, grid_y);

    kernel<<<grid, block, 0, 0>>>(src, dst, height, width, size);
    cudaDeviceSynchronize();
    CUDAOP_CHECK_CUDA_SATUS(cudaGetLastError());

    std::vector<Type> cuda_res(size * sizeof(Type));
    CUDA_CHECK(cudaMemcpy(cuda_res.data(), dst, size * sizeof(Type), cudaMemcpyDeviceToHost));

    int cnt = 0;
    for (int i = 0; i < size; i++)
    {
        if (cuda_res[i] != input[i])
        {
            cnt++;
        }
    }

    std::cout << "error rate = " << (float)cnt / size * 100 << "%" << std::endl;

    CUDA_CHECK_AND_FREE(src);
    CUDA_CHECK_AND_FREE(dst);
    return 0;
}

error rate = 0%
==23856== Profiling application: ./8-ori_store/oriVersionDemo
==23856== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   51.54%  42.248ms         1  42.248ms  42.248ms  42.248ms  [CUDA memcpy HtoD]
                   47.34%  38.813ms         1  38.813ms  38.813ms  38.813ms  [CUDA memcpy DtoH]
                    1.12%  918.90us         1  918.90us  918.90us  918.90us  kernel(void*, void*, int, int, int)
      API calls:   75.23%  260.86ms         2  130.43ms  397.31us  260.46ms  cudaMalloc
                   23.40%  81.147ms         2  40.573ms  38.936ms  42.211ms  cudaMemcpy
                    0.82%  2.8493ms         2  1.4246ms  450.06us  2.3992ms  cudaFree
                    0.31%  1.0767ms         1  1.0767ms  1.0767ms  1.0767ms  cudaDeviceSynchronize
                    0.16%  556.22us         1  556.22us  556.22us  556.22us  cuDeviceTotalMem
                    0.05%  169.61us       101  1.6790us     195ns  68.500us  cuDeviceGetAttribute
                    0.01%  50.086us         1  50.086us  50.086us  50.086us  cudaLaunchKernel
                    0.01%  29.827us         1  29.827us  29.827us  29.827us  cuDeviceGetName
                    0.00%  6.6800us         1  6.6800us  6.6800us  6.6800us  cuDeviceGetPCIBusId
                    0.00%  1.4900us         3     496ns     231ns     990ns  cuDeviceGetCount
                    0.00%  1.0370us         2     518ns     225ns     812ns  cuDeviceGet
                    0.00%     405ns         1     405ns     405ns     405ns  cuDeviceGetUuid
                    0.00%     400ns         1     400ns     400ns     400ns  cudaGetLastError
coalesced code and nvprof result:
#define Type float
inline void RandomFloatVector(std::vector<float> &vec_f)
{
    srand((unsigned int)(time(NULL)));
    std::for_each(vec_f.begin(), vec_f.end(), [](float &f) { f = (rand() % 255) / 255.f; });
}

__global__ void kernel(void *src, void *dst, int height, int width)
{
    int block_top_left_x = blockIdx.x * blockDim.x;
    int block_top_left_y = blockIdx.y * blockDim.y;

    for (int i = 0; i < 3; i++)
    {
        int index = threadIdx.x + block_top_left_x * 3 + i * blockDim.x + (threadIdx.y + block_top_left_y) * width * 3;
        static_cast<Type *>(dst)[index] = static_cast<Type *>(src)[index];
    }
}

int main(int argv, char **argc)
{
    void *src, *dst;
    int height = 1024, width = 20480;
    int size = height * width * 3;
    std::vector<Type> input(size);
    RandomFloatVector(input);
    CUDA_CHECK(cudaMalloc(&src, size * sizeof(Type)));
    CUDA_CHECK(cudaMalloc(&dst, size * sizeof(Type)));
    CUDA_CHECK(cudaMemcpy(src, input.data(), size * sizeof(Type), cudaMemcpyHostToDevice));

    dim3 block(32, 16);
    int grid_x = (width + block.x - 1) / block.x;
    int grid_y = (height + block.y - 1) / block.y;
    dim3 grid(grid_x, grid_y);
    kernel<<<grid, block, 0, 0>>>(src, dst, height, width);
    cudaDeviceSynchronize();

    std::vector<Type> cuda_res(size * sizeof(Type));
    CUDA_CHECK(cudaMemcpy(cuda_res.data(), dst, size * sizeof(Type), cudaMemcpyDeviceToHost));

    int cnt = 0;
    for (int i = 0; i < size; i++)
    {
        if (cuda_res[i] != input[i])
        {
            cnt++;
        }
    }

    std::cout << "error rate = " << (float)cnt / size * 100 << "%" << std::endl;

    CUDA_CHECK_AND_FREE(src);
    CUDA_CHECK_AND_FREE(dst);

    return 0;
}

error rate = 0%
==23997== Profiling application: ./9-coalesced_store/coalesceVersionDemo
==23997== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   51.33%  42.447ms         1  42.447ms  42.447ms  42.447ms  [CUDA memcpy HtoD]
                   47.15%  38.993ms         1  38.993ms  38.993ms  38.993ms  [CUDA memcpy DtoH]
                    1.52%  1.2535ms         1  1.2535ms  1.2535ms  1.2535ms  kernel(void*, void*, int, int)
      API calls:   72.97%  233.81ms         2  116.90ms  296.09us  233.51ms  cudaMalloc
                   25.43%  81.487ms         2  40.744ms  39.124ms  42.364ms  cudaMemcpy
                    0.90%  2.8815ms         2  1.4408ms  480.58us  2.4009ms  cudaFree
                    0.45%  1.4480ms         1  1.4480ms  1.4480ms  1.4480ms  cudaDeviceSynchronize
                    0.17%  552.18us         1  552.18us  552.18us  552.18us  cuDeviceTotalMem
                    0.05%  173.64us       101  1.7190us     189ns  70.564us  cuDeviceGetAttribute
                    0.01%  36.519us         1  36.519us  36.519us  36.519us  cudaLaunchKernel
                    0.01%  29.888us         1  29.888us  29.888us  29.888us  cuDeviceGetName
                    0.00%  5.5120us         1  5.5120us  5.5120us  5.5120us  cuDeviceGetPCIBusId
                    0.00%  1.6540us         3     551ns     270ns  1.0780us  cuDeviceGetCount
                    0.00%  1.1280us         2     564ns     222ns     906ns  cuDeviceGet
                    0.00%     363ns         1     363ns     363ns     363ns  cuDeviceGetUuid

As far as I know, for devices with computing capabilities after 6.0, the memory access granularity is 32 bytes no matter when L1 is turned on or when L1 is not turned on. So should there be no difference between unsigned char type and float/int type?

My doubt is whether the L1 cache and L2 cache of the Turing architecture have their own optimizations. The non-consolidated version of data access will satisfy its automatic optimization strategy? For data writing, first Write Through to L1, then write Back to L2. For the non-merged version of the kernel function, which has multiple overlapping writes, will it try to merge in the L1/L2 Cache, and then merge Write data to DRAM? Otherwise, why is the performance after optimization not as good as the performance before optimization?

I don’t have access to a 2080Ti (which is TU102, I believe). However when I run your codes on a T4 (also Turing, TU104) I witness that the non-coalesced code is slower than the coalesced code:

$ cat t1889nc.cu
#include <iostream>
#include <vector>
#include <algorithm>
#define CUDA_CHECK(x) x

#define Type float
inline void RandomFloatVector(std::vector<float> &vec_f)
{
    srand((unsigned int)(time(NULL)));
    std::for_each(vec_f.begin(), vec_f.end(), [](float &f) { f = (rand() % 255) / 255.f; });
}

__global__ void kernel(void *src, void *dst, int height, int width, int size)
{
    int block_top_left_x = blockIdx.x * blockDim.x;
    int block_top_left_y = blockIdx.y * blockDim.y;

    int index = threadIdx.x + block_top_left_x + (threadIdx.y + block_top_left_y) * width;
    for (int i = 0; i < 3; i++)
    {
        static_cast<Type *>(dst)[index * 3 + i] = static_cast<Type *>(src)[index * 3 + i];
    }
}

int main(int argv, char **argc)
{
    void *src, *dst;
    int height = 1024, width = 20480;
    int size = height * width * 3;
    std::vector<Type> input(size);
    RandomFloatVector(input);
    CUDA_CHECK(cudaMalloc(&src, size * sizeof(Type)));
    CUDA_CHECK(cudaMalloc(&dst, size * sizeof(Type)));
    CUDA_CHECK(cudaMemcpy(src, input.data(), size * sizeof(Type), cudaMemcpyHostToDevice));

    dim3 block(32, 16);
    int grid_x = (width + block.x - 1) / block.x;
    int grid_y = (height + block.y - 1) / block.y;
    dim3 grid(grid_x, grid_y);

    kernel<<<grid, block, 0, 0>>>(src, dst, height, width, size);
    cudaDeviceSynchronize();

    std::vector<Type> cuda_res(size * sizeof(Type));
    CUDA_CHECK(cudaMemcpy(cuda_res.data(), dst, size * sizeof(Type), cudaMemcpyDeviceToHost));

    int cnt = 0;
    for (int i = 0; i < size; i++)
    {
        if (cuda_res[i] != input[i])
        {
            cnt++;
        }
    }

    std::cout << "error rate = " << (float)cnt / size * 100 << "%" << std::endl;

    return 0;
}
$ cat t1889c.cu
#include <iostream>
#include <vector>
#include <algorithm>
#define CUDA_CHECK(x) x
#define Type float
inline void RandomFloatVector(std::vector<float> &vec_f)
{
    srand((unsigned int)(time(NULL)));
    std::for_each(vec_f.begin(), vec_f.end(), [](float &f) { f = (rand() % 255) / 255.f; });
}

__global__ void kernel(void *src, void *dst, int height, int width)
{
    int block_top_left_x = blockIdx.x * blockDim.x;
    int block_top_left_y = blockIdx.y * blockDim.y;

    for (int i = 0; i < 3; i++)
    {
        int index = threadIdx.x + block_top_left_x * 3 + i * blockDim.x + (threadIdx.y + block_top_left_y) * width * 3;
        static_cast<Type *>(dst)[index] = static_cast<Type *>(src)[index];
    }
}

int main(int argv, char **argc)
{
    void *src, *dst;
    int height = 1024, width = 20480;
    int size = height * width * 3;
    std::vector<Type> input(size);
    RandomFloatVector(input);
    CUDA_CHECK(cudaMalloc(&src, size * sizeof(Type)));
    CUDA_CHECK(cudaMalloc(&dst, size * sizeof(Type)));
    CUDA_CHECK(cudaMemcpy(src, input.data(), size * sizeof(Type), cudaMemcpyHostToDevice));

    dim3 block(32, 16);
    int grid_x = (width + block.x - 1) / block.x;
    int grid_y = (height + block.y - 1) / block.y;
    dim3 grid(grid_x, grid_y);
    kernel<<<grid, block, 0, 0>>>(src, dst, height, width);
    cudaDeviceSynchronize();

    std::vector<Type> cuda_res(size * sizeof(Type));
    CUDA_CHECK(cudaMemcpy(cuda_res.data(), dst, size * sizeof(Type), cudaMemcpyDeviceToHost));

    int cnt = 0;
    for (int i = 0; i < size; i++)
    {
        if (cuda_res[i] != input[i])
        {
            cnt++;
        }
    }

    std::cout << "error rate = " << (float)cnt / size * 100 << "%" << std::endl;

    return 0;
}
$ nvcc -o t1889c t1889c.cu -arch=sm_75 -std=c++11
$ nvcc -o t1889nc t1889nc.cu -arch=sm_75 -std=c++11
$ nvprof ./t1889c
==31490== NVPROF is profiling process 31490, command: ./t1889c
error rate = 0%
==31490== Profiling application: ./t1889c
==31490== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   58.95%  106.54ms         1  106.54ms  106.54ms  106.54ms  [CUDA memcpy HtoD]
                   39.77%  71.882ms         1  71.882ms  71.882ms  71.882ms  [CUDA memcpy DtoH]
                    1.29%  2.3231ms         1  2.3231ms  2.3231ms  2.3231ms  kernel(void*, void*, int, int)
      API calls:   54.01%  216.42ms         2  108.21ms  398.37us  216.02ms  cudaMalloc
                   44.98%  180.22ms         2  90.112ms  72.503ms  107.72ms  cudaMemcpy
                    0.58%  2.3236ms         1  2.3236ms  2.3236ms  2.3236ms  cudaDeviceSynchronize
                    0.27%  1.0630ms         2  531.51us  515.18us  547.84us  cuDeviceTotalMem
                    0.13%  520.28us       202  2.5750us     139ns  121.13us  cuDeviceGetAttribute
                    0.02%  84.953us         1  84.953us  84.953us  84.953us  cudaLaunchKernel
                    0.01%  54.940us         2  27.470us  23.272us  31.668us  cuDeviceGetName
                    0.00%  8.3150us         2  4.1570us  1.8860us  6.4290us  cuDeviceGetPCIBusId
                    0.00%  1.3110us         4     327ns     166ns     723ns  cuDeviceGet
                    0.00%  1.0860us         3     362ns     212ns     611ns  cuDeviceGetCount
                    0.00%     457ns         2     228ns     197ns     260ns  cuDeviceGetUuid
$ nvprof ./t1889nc
==31590== NVPROF is profiling process 31590, command: ./t1889nc
error rate = 0%
==31590== Profiling application: ./t1889nc
==31590== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   58.18%  104.47ms         1  104.47ms  104.47ms  104.47ms  [CUDA memcpy HtoD]
                   40.17%  72.132ms         1  72.132ms  72.132ms  72.132ms  [CUDA memcpy DtoH]
                    1.64%  2.9509ms         1  2.9509ms  2.9509ms  2.9509ms  kernel(void*, void*, int, int, int)
      API calls:   53.99%  215.23ms         2  107.61ms  400.73us  214.83ms  cudaMalloc
                   44.84%  178.75ms         2  89.374ms  72.714ms  106.03ms  cudaMemcpy
                    0.74%  2.9514ms         1  2.9514ms  2.9514ms  2.9514ms  cudaDeviceSynchronize
                    0.26%  1.0307ms         2  515.36us  514.71us  516.00us  cuDeviceTotalMem
                    0.13%  509.26us       202  2.5210us     135ns  114.79us  cuDeviceGetAttribute
                    0.02%  78.643us         1  78.643us  78.643us  78.643us  cudaLaunchKernel
                    0.02%  63.666us         2  31.833us  23.380us  40.286us  cuDeviceGetName
                    0.00%  8.5400us         2  4.2700us  1.7950us  6.7450us  cuDeviceGetPCIBusId
                    0.00%  1.2740us         4     318ns     144ns     733ns  cuDeviceGet
                    0.00%  1.0930us         3     364ns     204ns     636ns  cuDeviceGetCount
                    0.00%     491ns         2     245ns     207ns     284ns  cuDeviceGetUuid
$ nvidia-smi
Fri Sep 10 08:08:01 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.73.01    Driver Version: 460.73.01    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            On   | 00000000:04:00.0 Off |                    0 |
| N/A   29C    P8    14W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla T4            On   | 00000000:05:00.0 Off |                    0 |
| N/A   27C    P8     9W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Thu_Jun_11_22:26:38_PDT_2020
Cuda compilation tools, release 11.0, V11.0.194
Build cuda_11.0_bu.TC445_37.28540450_0
$

So I am unable to explain your observation.

I tried to test this code on a T4 graphics card, but I still got the result that the merged version of the memory fetch is not better than the merged version of the memory fetch.Can you help me explain this phenomenon?

non-coalesced code:

#include <algorithm>
#include <chrono>
#include <cuda_runtime.h>
#include <iostream>
#include <vector>

#define CUDA_CHECK(status)                                                    \
    do                                                                        \
    {                                                                         \
        auto ret = (status);                                                  \
        if (ret != 0)                                                         \
        {                                                                     \
            throw std::runtime_error("cuda failure: " + std::to_string(ret) + \
                                     " (" + cudaGetErrorString(ret) + ")" +   \
                                     " at " + __FILE__ + ":" +                \
                                     std::to_string(__LINE__));               \
        }                                                                     \
    } while (0)

#define CUDAOP_CHECK_CUDA_SATUS(status)                                      \
    do                                                                       \
    {                                                                        \
        auto rst = status;                                                   \
        if ((rst) != cudaSuccess)                                            \
        {                                                                    \
            throw std::runtime_error("cuda err: " +                          \
                                     std::to_string(static_cast<int>(rst)) + \
                                     " (" + cudaGetErrorString(rst) + ")" +  \
                                     " at " + __FILE__ + ":" +               \
                                     std::to_string(__LINE__));              \
        }                                                                    \
    } while (0)

#define CUDA_CHECK_AND_FREE(device_ptr) \
    do                                  \
    {                                   \
        if (device_ptr)                 \
        {                               \
            cudaFree(device_ptr);       \
        }                               \
    } while (0)

class Clock
{
public:
    Clock() = delete;

    Clock(std::string event) : _event(event)
    {
        _start_time = std::chrono::high_resolution_clock::now();
    }

    ~Clock()
    {
        std::cout << _event << formatTime() << std::endl;
    };

    double DurationMs()
    {
        using namespace std::chrono;
        typedef duration<double, std::ratio<1, 1000>> milliSecond;
        milliSecond duration_ms =
            duration_cast<milliSecond>(high_resolution_clock::now() - _start_time);
        return duration_ms.count();
    };

private:
    std::string formatTime()
    {
        auto ms = DurationMs();
        if (ms < 1.0)
        {
            return std::to_string(ms * 1000.0) + "μs";
        }
        if (ms > 1000.0)
        {
            return std::to_string(ms / 1000.0) + "s";
        }
        return std::to_string(ms) + "ms";
    }

    std::string _event;
    std::chrono::high_resolution_clock::time_point _start_time;
};

#define Type float

inline void RandomInt8Vector(std::vector<unsigned char> &vec_uc)
{
    srand((unsigned int)(time(NULL)));
    std::for_each(vec_uc.begin(), vec_uc.end(), [](unsigned char &uc)
                  { uc = rand() % 255; });
}

inline void RandomFloatVector(std::vector<float> &vec_f)
{
    srand((unsigned int)(time(NULL)));
    std::for_each(vec_f.begin(), vec_f.end(), [](float &f)
                  { f = (rand() % 255) / 255.f; });
}

__global__ void kernel(void *src, void *dst, int height, int width, int size)
{
    int block_top_left_x = blockIdx.x * blockDim.x;
    int block_top_left_y = blockIdx.y * blockDim.y;

    int index = threadIdx.x + block_top_left_x + (threadIdx.y + block_top_left_y) * width;
    for (int i = 0; i < 3; i++)
    {
        static_cast<Type *>(dst)[index * 3 + i] = static_cast<Type *>(src)[index * 3 + i];
    }
}

int main(int argv, char **argc)
{
    void *src, *dst;
    int height = 1024, width = 20480;
    int size = height * width * 3;
    std::vector<Type> input(size);
    RandomFloatVector(input);
    CUDA_CHECK(cudaMalloc(&src, size * sizeof(Type)));
    CUDA_CHECK(cudaMalloc(&dst, size * sizeof(Type)));
    CUDA_CHECK(cudaMemcpy(src, input.data(), size * sizeof(Type), cudaMemcpyHostToDevice));

    dim3 block(32, 16);
    int grid_x = (width + block.x - 1) / block.x;
    int grid_y = (height + block.y - 1) / block.y;
    dim3 grid(grid_x, grid_y);

    for (int i = 0; i < 10; i++)
    {
        Clock clk("ori test ");
        kernel<<<grid, block, 0, 0>>>(src, dst, height, width, size);
        cudaDeviceSynchronize();
    }
    CUDAOP_CHECK_CUDA_SATUS(cudaGetLastError());

    std::vector<Type> cuda_res(size * sizeof(Type));
    CUDA_CHECK(cudaMemcpy(cuda_res.data(), dst, size * sizeof(Type), cudaMemcpyDeviceToHost));

    int cnt = 0;
    for (int i = 0; i < size; i++)
    {
        if (cuda_res[i] != input[i])
        {
            cnt++;
        }
    }

    std::cout << "error rate = " << (float)cnt / size * 100 << "%" << std::endl;

    CUDA_CHECK_AND_FREE(src);
    CUDA_CHECK_AND_FREE(dst);
    return 0;
}

coalesced code:

#include <algorithm>
#include <chrono>
#include <cuda_runtime.h>
#include <iostream>
#include <vector>

#define CUDA_CHECK(status)                                                    \
    do                                                                        \
    {                                                                         \
        auto ret = (status);                                                  \
        if (ret != 0)                                                         \
        {                                                                     \
            throw std::runtime_error("cuda failure: " + std::to_string(ret) + \
                                     " (" + cudaGetErrorString(ret) + ")" +   \
                                     " at " + __FILE__ + ":" +                \
                                     std::to_string(__LINE__));               \
        }                                                                     \
    } while (0)

#define CUDAOP_CHECK_CUDA_SATUS(status)                                      \
    do                                                                       \
    {                                                                        \
        auto rst = status;                                                   \
        if ((rst) != cudaSuccess)                                            \
        {                                                                    \
            throw std::runtime_error("cuda err: " +                          \
                                     std::to_string(static_cast<int>(rst)) + \
                                     " (" + cudaGetErrorString(rst) + ")" +  \
                                     " at " + __FILE__ + ":" +               \
                                     std::to_string(__LINE__));              \
        }                                                                    \
    } while (0)

#define CUDA_CHECK_AND_FREE(device_ptr) \
    do                                  \
    {                                   \
        if (device_ptr)                 \
        {                               \
            cudaFree(device_ptr);       \
        }                               \
    } while (0)

class Clock
{
public:
    Clock() = delete;

    Clock(std::string event) : _event(event)
    {
        _start_time = std::chrono::high_resolution_clock::now();
    }

    ~Clock()
    {
        std::cout << _event << formatTime() << std::endl;
    };

    double DurationMs()
    {
        using namespace std::chrono;
        typedef duration<double, std::ratio<1, 1000>> milliSecond;
        milliSecond duration_ms =
            duration_cast<milliSecond>(high_resolution_clock::now() - _start_time);
        return duration_ms.count();
    };

private:
    std::string formatTime()
    {
        auto ms = DurationMs();
        if (ms < 1.0)
        {
            return std::to_string(ms * 1000.0) + "μs";
        }
        if (ms > 1000.0)
        {
            return std::to_string(ms / 1000.0) + "s";
        }
        return std::to_string(ms) + "ms";
    }

    std::string _event;
    std::chrono::high_resolution_clock::time_point _start_time;
};

#define Type float
inline void RandomInt8Vector(std::vector<unsigned char> &vec_uc)
{
    srand((unsigned int)(time(NULL)));
    std::for_each(vec_uc.begin(), vec_uc.end(), [](unsigned char &uc)
                  { uc = rand() % 255; });
}

inline void RandomFloatVector(std::vector<float> &vec_f)
{
    srand((unsigned int)(time(NULL)));
    std::for_each(vec_f.begin(), vec_f.end(), [](float &f)
                  { f = (rand() % 255) / 255.f; });
}

__global__ void kernel(void *src, void *dst, int height, int width)
{
    int block_top_left_x = blockIdx.x * blockDim.x;
    int block_top_left_y = blockIdx.y * blockDim.y;

    for (int i = 0; i < 3; i++)
    {
        int index = threadIdx.x + block_top_left_x * 3 + i * blockDim.x + (threadIdx.y + block_top_left_y) * width * 3;
        static_cast<Type *>(dst)[index] = static_cast<Type *>(src)[index];
    }
}

int main(int argv, char **argc)
{
    void *src, *dst;
    int height = 1024, width = 20480;
    int size = height * width * 3;
    std::vector<Type> input(size);
    RandomFloatVector(input);
    CUDA_CHECK(cudaMalloc(&src, size * sizeof(Type)));
    CUDA_CHECK(cudaMalloc(&dst, size * sizeof(Type)));
    CUDA_CHECK(cudaMemcpy(src, input.data(), size * sizeof(Type), cudaMemcpyHostToDevice));

    dim3 block(32, 16);
    int grid_x = (width + block.x - 1) / block.x;
    int grid_y = (height + block.y - 1) / block.y;
    dim3 grid(grid_x, grid_y);
    for (int i = 0; i < 10; i++)
    {
        Clock clk("coalesced test ");
        kernel<<<grid, block, 0, 0>>>(src, dst, height, width);
        cudaDeviceSynchronize();
    }

    std::vector<Type> cuda_res(size * sizeof(Type));
    CUDA_CHECK(cudaMemcpy(cuda_res.data(), dst, size * sizeof(Type), cudaMemcpyDeviceToHost));

    int cnt = 0;
    for (int i = 0; i < size; i++)
    {
        if (cuda_res[i] != input[i])
        {
            cnt++;
        }
    }

    std::cout << "error rate = " << (float)cnt / size * 100 << "%" << std::endl;

    CUDA_CHECK_AND_FREE(src);
    CUDA_CHECK_AND_FREE(dst);

    return 0;
}

test result:

1024*20480
//ori version
ori test 1.018737ms
ori test 942.434000μs
ori test 941.261000μs
ori test 929.056000μs
ori test 932.231000μs
ori test 929.277000μs
ori test 929.843000μs
ori test 930.698000μs
ori test 929.305000μs
ori test 929.050000μs
    
    //coalesced version
coalesced test 1.020968ms
coalesced test 951.732000μs
coalesced test 949.066000μs
coalesced test 954.971000μs
coalesced test 949.890000μs
coalesced test 950.264000μs
coalesced test 950.621000μs
coalesced test 950.881000μs
coalesced test 951.645000μs
coalesced test 948.397000μs

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.91.03    Driver Version: 460.91.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce RTX 208...  Off  | 00000000:02:00.0 Off |                  N/A |
| 26%   50C    P0    19W / 250W |      0MiB / 11019MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Probably that isn’t what you meant to say.

I think its doubtful you are running on a T4. When I run your code on a T4, I get execution times in the range of 2.3-3.0 milliseconds. You are getting execution times of ~1.0 milliseconds. So something doesn’t add up.

Here is my output, running the codes you have now posted, running on the same T4 setup that I previously reported and gave the full test case for:

$ ./t1890c
coalesced test 2.405249ms
coalesced test 2.332734ms
coalesced test 2.331415ms
coalesced test 2.329912ms
coalesced test 2.329931ms
coalesced test 2.330216ms
coalesced test 2.329841ms
coalesced test 2.330183ms
coalesced test 2.329539ms
coalesced test 2.330791ms
error rate = 0%
$ ./t1890nc
ori test 3.040014ms
ori test 2.979036ms
ori test 2.977041ms
ori test 2.978522ms
ori test 2.974907ms
ori test 2.975375ms
ori test 2.976410ms
ori test 2.977983ms
ori test 2.979658ms
ori test 2.978092ms
error rate = 0%
$

The coalesced test is faster.
I am unable to explain your observation. It’s unlikely that I would be able to respond to further requests here. As I’ve already indicated, I doubt your claims (e.g. that your results are from running this code on a T4) are correct. The T4 GPU has a peak theoretical memory bandwidth of 320 GB/s.

The total amount of data (summing reads and writes) being touched here is:

1024*20480*3*4*2 = ~500MB

dividing that number by 320GB/s (not actually achievable in practice) gives 1.57ms. So your claim that you are running this code on a T4 GPU and getting ~1.0ms execution time to me is not plausible, as that would indicate you are running at a memory bandwidth rate of ~500GB/s. That’s not possible on a T4.