Transposing the grid, but not its thread-blocks, results in slower performance, why ?

Dear Community,

I am bringing my stack overflow question to this forum in hope of some insights, or at least some directions where to look.

I have come to a GPU computing situation in which require to transpose a CUDA grid. So, if block_{x,y} originally acted on data region d_{x,y}, now it acts on data region d_{y,x}, therefore block_{y,x} would act on data region d_{x,y}. An example is presented in the following figure.

https://ibb.co/hxF0Nc

It is worth mentioning that threads are not transposed inside each block, that is, once the block is located, the threadIdx.x and threadIdx.y values are used in a normal way for their x and y offsets, respectively.

From what I know, in theory this design should do no harm in performance, as the memory coalescing pattern is still preserved, i.e., threads inside a block are not transposed, it is just the grid that re-arranged its blocks. However I found that when transposing the grid, the kernel runs approx. 2X slower than in the normal case. I made a toy example to illustrate the situation.

➜  transpose-grid ./prog 10000 10000 100 0
init data.....................done: zero matrix of 10000 x 10000
copy data to GPU..............done
preparing grid................done: block(32, 32, 1), grid(313, 313, 1)
normal_kernel (100 rep).......done: 0.935132 ms
verifying correctness.........ok
➜  transpose-grid ./prog 10000 10000 100 1
init data.....................done: zero matrix of 10000 x 10000
copy data to GPU..............done
preparing grid................done: block(32, 32, 1), grid(313, 313, 1)
transp_kernel (100 rep).......done: 1.980445 ms
verifying correctness.........ok

The Nvidia CUDA profiler showed that the transposed version had 2X slower L2 Cache performance (MB/s) than the normal case. Source code to reproduce the case:

// -----------------------------------
 // can compile as nvcc main.cu -o prog
 // -----------------------------------

 #include <cuda.h>
 #include <cstdio>

 #define BSIZE2D 32

 __global__ void normal_kernel(int *dmat, const int m, const int n){
     const int i = blockIdx.y*blockDim.y + threadIdx.y;
     const int j = blockIdx.x*blockDim.x + threadIdx.x;
     if(i < m && j < n){
         dmat[i*n + j] = 1;
     }
 }

 __global__ void transp_kernel(int *dmat, const int m, const int n){
     const int i = blockIdx.x*blockDim.x + threadIdx.y;
     const int j = blockIdx.y*blockDim.y + threadIdx.x;
     if(i < m && j < n){
         dmat[i*n + j] = 1;
     }
 }

int verify(int *hmat, const int m, const int n){
     printf("verifying correctness........."); fflush(stdout);
     for(int i=0; i<m*n; ++i){
         if(hmat[i] != 1){
             fprintf(stderr, "Incorrect value at m[%i,%i] = %i\n", i/n, i%n);
             return 0;
         }
     }
     printf("ok\n"); fflush(stdout);
     return 1;
 }
 int main(int argc, char **argv){
     if(argc != 5){
         printf("\nrun as ./prog m n r t\n\nr = number of repeats\nt = transpose (1 or 0)\n");
         exit(EXIT_FAILURE);
     }
     const int m = atoi(argv[1]);
     const int n = atoi(argv[2]);
     const int r = atoi(argv[3]);
     const int t = atoi(argv[4]);
     const unsigned int size = m*n;
     cudaEvent_t start, stop;
     cudaEventCreate(&start);
     cudaEventCreate(&stop);
     float time;
     int *hmat, *dmat;

printf("init data....................."); fflush(stdout);
     hmat = (int*)malloc(sizeof(int)*(size));
     for(int i=0; i<size; ++i){
         hmat[i] = 0;
     }
     printf("done: zero matrix of %i rows x %i cols\n", m, n);

printf("copy data to GPU.............."); fflush(stdout);
     cudaMalloc(&dmat, sizeof(int)*(size));
     cudaMemcpy(dmat, hmat, sizeof(int)*(size), cudaMemcpyHostToDevice);
     printf("done\n");

printf("preparing grid................"); fflush(stdout);
     dim3 block(BSIZE2D, BSIZE2D, 1);
     dim3 grid;
     // if transpose or not
     if(t){
         grid = dim3((m + BSIZE2D - 1)/BSIZE2D, (n + BSIZE2D - 1)/BSIZE2D, 1);
     }
     else{
         grid = dim3((n + BSIZE2D - 1)/BSIZE2D, (m + BSIZE2D - 1)/BSIZE2D, 1);
     }
     printf("done: block(%i, %i, %i), grid(%i, %i, %i)\n", block.x, block.y, block.z, grid.x, grid.y, grid.z);

if(t){
         printf("transp_kernel (%3i rep).......", r); fflush(stdout);
         cudaEventRecord(start, 0);
         for(int i=0; i<r; ++i){
             transp_kernel<<<grid, block>>>(dmat, m, n);
             cudaDeviceSynchronize();
         }
         cudaEventRecord(stop,0);
         cudaEventSynchronize(stop);
         cudaEventElapsedTime(&time, start, stop); // that's our time!
         printf("done: %f ms\n", time/(float)r);
     }
     else{
         printf("normal_kernel (%3i rep).......", r); fflush(stdout);
         cudaEventRecord(start, 0);
         for(int i=0; i<r; ++i){
             normal_kernel<<<grid, block>>>(dmat, m, n);
             cudaDeviceSynchronize();
         }
         cudaEventRecord(stop,0);
         cudaEventSynchronize(stop);
         cudaEventElapsedTime(&time, start, stop); // that's our time!
         printf("done: %f ms\n", time/(float)r);
     }

cudaMemcpy(hmat, dmat, sizeof(int)*size, cudaMemcpyDeviceToHost);
     verify(hmat, m, n);
     exit(EXIT_SUCCESS);
 }
  1. check that dmat address is aligned to 128 bytes
  2. it may be due to TLB translation - transposed code touches only 128 bytes from each 4KB page and then goes (in next warp) to another page. check speeds with m=n=3200

Dear BulatZiganshin,
I have tried with 3200, and it is around 30% slower

➜  transpose-grid ./prog 3200 3200 100 0
init data.....................done: zero matrix of 3200 rows x 3200 cols
copy data to GPU..............done
preparing grid................done: block(32, 32, 1), grid(100, 100, 1)
normal_kernel (100 rep).......done: 0.105268 ms
verifying correctness.........ok
➜  transpose-grid ./prog 3200 3200 100 1
init data.....................done: zero matrix of 3200 rows x 3200 cols
copy data to GPU..............done
preparing grid................done: block(32, 32, 1), grid(100, 100, 1)
transp_kernel (100 rep).......done: 0.133116 ms
verifying correctness.........ok

and 32000 as well, where it is around 60% slower

init data.....................done: zero matrix of 32000 rows x 32000 cols
copy data to GPU..............done
preparing grid................done: block(32, 32, 1), grid(1000, 1000, 1)
normal_kernel (100 rep).......done: 9.499826 ms
verifying correctness.........ok
➜  transpose-grid ./prog 32000 32000 100 1
init data.....................done: zero matrix of 32000 rows x 32000 cols
copy data to GPU..............done
preparing grid................done: block(32, 32, 1), grid(1000, 1000, 1)
transp_kernel (100 rep).......done: 15.043217 ms
verifying correctness.........ok

Do you know of any document detailing more about L2 Cache for the Nvidia GPU?
Regarding the TLB translation aspect, I still can’t see how the 128-byte touch-only aspect can be different from the normal case, as it is a block that could be anywhere, all of its threads inside are aligned, coalesced and packed into 32 warps with no gaps between threads, just offsets of “n” between warps, just like in the normal case.

I played around with this code a bit when you first posted it on SO. I observed about a 30% difference in execution time between the two cases, similar to your results. I also observed a few other things:

  1. If I ran the code with nvprof, the execution time difference disappeared.
  2. If I dropped below a certain size, the execution time difference disappeared.

I didn’t respond at the time because its just speculation, but similar to BulatZiganshin, I concluded that the transposed pattern was behaving differently in terms of global memory access. This seems obvious of course. I concluded it might be an L2 thrashing/not-thrashing effect. As BulatZiganshin states, it could be a TLB effect as well. I haven’t studied the sizes or access patterns closely.

Because, in my case, the problem “went away” when I tried to use nvprof, it did not seem to me that I was going to be able to make conclusive statements using just the profiler.

Thanks for looking into it txbob. I have also noted that the performance difference may vary with different GPU architectures (Kepler, Pascal, Volta). I remember Kepler took a lesser performance hit than Pascal, which is were i am testing now (Titan X (Pascal)).

My speculation is that maybe the CUDA scheduler “prefers” blocks along the X direction to map along the X direction in data as well. And the same for the Y direction. Maybe this is something that the L2 Cache is assuming in its prefetching/prediction policy? I hope this is not true as it would harm many possible algorithms.

EDIT: testing with nvprof I still got difference in performance, running on a Titan X (Pascal)

➜  transpose-grid nvprof ./prog 32000 32000 100 0
==5924== NVPROF is profiling process 5924, command: ./prog 32000 32000 100 0
init data.....................done: zero matrix of 32000 rows x 32000 cols
copy data to GPU..............done
preparing grid................done: block(32, 32, 1), grid(1000, 1000, 1)
normal_kernel (100 rep).......done: 11.012840 ms
verifying correctness.........ok
==5924== Profiling application: ./prog 32000 32000 100 0
==5924== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   45.49%  1.09982s       100  10.998ms  10.943ms  12.564ms  normal_kernel(int*, int, int)
                   28.48%  688.48ms         1  688.48ms  688.48ms  688.48ms  [CUDA memcpy HtoD]
                   26.03%  629.46ms         1  629.46ms  629.46ms  629.46ms  [CUDA memcpy DtoH]
      API calls:   49.24%  1.31797s         2  658.99ms  629.52ms  688.45ms  cudaMemcpy
                   41.12%  1.10064s       100  11.006ms  10.946ms  12.567ms  cudaDeviceSynchronize
                    9.48%  253.86ms         2  126.93ms  1.5800us  253.86ms  cudaEventCreate
                    0.11%  2.9380ms         1  2.9380ms  2.9380ms  2.9380ms  cudaMalloc
                    0.03%  673.08us       100  6.7300us  5.6320us  48.481us  cudaLaunch
                    0.01%  299.81us        94  3.1890us     346ns  120.77us  cuDeviceGetAttribute
                    0.01%  195.72us         1  195.72us  195.72us  195.72us  cuDeviceTotalMem
                    0.01%  174.24us         1  174.24us  174.24us  174.24us  cuDeviceGetName
                    0.00%  36.074us       300     120ns      87ns  2.8830us  cudaSetupArgument
                    0.00%  17.434us       100     174ns     124ns  2.0600us  cudaConfigureCall
                    0.00%  13.329us         2  6.6640us  3.5250us  9.8040us  cudaEventRecord
                    0.00%  3.6380us         1  3.6380us  3.6380us  3.6380us  cudaEventSynchronize
                    0.00%  2.5000us         1  2.5000us  2.5000us  2.5000us  cudaEventElapsedTime
                    0.00%  2.3510us         3     783ns     329ns  1.5250us  cuDeviceGetCount
                    0.00%  1.3540us         2     677ns     375ns     979ns  cuDeviceGet
➜  transpose-grid nvprof ./prog 32000 32000 100 1
==5944== NVPROF is profiling process 5944, command: ./prog 32000 32000 100 1
init data.....................done: zero matrix of 32000 rows x 32000 cols
copy data to GPU..............done
preparing grid................done: block(32, 32, 1), grid(1000, 1000, 1)
transp_kernel (100 rep).......done: 15.049839 ms
verifying correctness.........ok
==5944== Profiling application: ./prog 32000 32000 100 1
==5944== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   53.27%  1.50339s       100  15.034ms  14.963ms  16.129ms  transp_kernel(int*, int, int)
                   24.39%  688.49ms         1  688.49ms  688.49ms  688.49ms  [CUDA memcpy HtoD]
                   22.34%  630.45ms         1  630.45ms  630.45ms  630.45ms  [CUDA memcpy DtoH]
      API calls:   48.88%  1.50435s       100  15.044ms  14.965ms  16.329ms  cudaDeviceSynchronize
                   42.86%  1.31897s         2  659.49ms  630.54ms  688.43ms  cudaMemcpy
                    8.11%  249.63ms         2  124.82ms  2.1690us  249.63ms  cudaEventCreate
                    0.11%  3.2337ms         1  3.2337ms  3.2337ms  3.2337ms  cudaMalloc
                    0.02%  663.40us       100  6.6330us  5.4360us  49.308us  cudaLaunch
                    0.02%  463.84us        94  4.9340us     343ns  196.56us  cuDeviceGetAttribute
                    0.01%  300.25us         1  300.25us  300.25us  300.25us  cuDeviceTotalMem
                    0.00%  52.553us         1  52.553us  52.553us  52.553us  cuDeviceGetName
                    0.00%  39.390us       300     131ns      89ns  2.8100us  cudaSetupArgument
                    0.00%  18.949us       100     189ns     141ns  1.6050us  cudaConfigureCall
                    0.00%  13.394us         2  6.6970us  4.4790us  8.9150us  cudaEventRecord
                    0.00%  3.1620us         1  3.1620us  3.1620us  3.1620us  cudaEventSynchronize
                    0.00%  2.6810us         1  2.6810us  2.6810us  2.6810us  cudaEventElapsedTime
                    0.00%  2.5040us         3     834ns     344ns  1.6890us  cuDeviceGetCount
                    0.00%  1.4600us         2     730ns     356ns  1.1040us  cuDeviceGet

The L2 does no prediction or prefetching that I am aware of. Cachelines are loaded from DRAM as needed/directed by actual transactions emanating “upstream”, from the code. Likewise they are flushed according to whatever MESI-style policy is implemented.

However the L2 associativity (an unpublished specification, AFAIK) will affect the address patterns in which the L2 can accept new cachelines without being forced to flush other cachelines. And the overall L2 size will also affect when the L2 will be forced to flush other cachelines.

The two codes obviously have different memory access patterns. Therefore it might be the case that the memory access patterns created by the transposed arrangement interact either with the overall L2 size, or the L2 associativity (or a TLB effect) that create less favorable memory traffic patterns.

It’s really just speculation, and I feel it is kind of weak, because mentally I can already poke holes in it.

So it’s just discussion, not any sort of answer.

The tests that I ran where I observed the change with nvprof were on K20m and K40m. If you’re able to profile the difference in behavior, you may want to start comparing memory and cache metrics between the two cases with the profiler to get a better idea of what may be happening.

Testing for other non-square sizes, I can report that for Y-tall (1M x 32) and X-long (32 x 1M) matrices the problem does not manifest. Watch out when trying sizes greater than 1M on the Y dimension, it can exceed the grid limit.

I obtained the cache metrics and the L2 performs about 2X faster in the normal case.
Normal

Invocations                               Metric Name                                    Metric Description         Min         Max         Avg
Device "TITAN X (Pascal) (0)"
    Kernel: normal_kernel(int*, unsigned int, unsigned int)
          1                             inst_per_warp                                 Instructions per warp   25.000000   25.000000   25.000000
          1                         branch_efficiency                                     Branch Efficiency     100.00%     100.00%     100.00%
          1                 warp_execution_efficiency                             Warp Execution Efficiency     100.00%     100.00%     100.00%
          1         warp_nonpred_execution_efficiency              Warp Non-Predicated Execution Efficiency      96.00%      96.00%      96.00%
          1                      inst_replay_overhead                           Instruction Replay Overhead    0.000064    0.000064    0.000064
          1      shared_load_transactions_per_request           Shared Memory Load Transactions Per Request    0.000000    0.000000    0.000000
          1     shared_store_transactions_per_request          Shared Memory Store Transactions Per Request    0.000000    0.000000    0.000000
          1       local_load_transactions_per_request            Local Memory Load Transactions Per Request    0.000000    0.000000    0.000000
          1      local_store_transactions_per_request           Local Memory Store Transactions Per Request    0.000000    0.000000    0.000000
          1              gld_transactions_per_request                  Global Load Transactions Per Request    0.000000    0.000000    0.000000
          1              gst_transactions_per_request                 Global Store Transactions Per Request    4.000000    4.000000    4.000000
          1                 shared_store_transactions                             Shared Store Transactions           0           0           0
          1                  shared_load_transactions                              Shared Load Transactions           0           0           0
          1                   local_load_transactions                               Local Load Transactions           0           0           0
          1                  local_store_transactions                              Local Store Transactions           0           0           0
          1                          gld_transactions                              Global Load Transactions           0           0           0
          1                          gst_transactions                             Global Store Transactions   128000000   128000000   128000000
          1                  sysmem_read_transactions                       System Memory Read Transactions           0           0           0
          1                 sysmem_write_transactions                      System Memory Write Transactions           5           5           5
          1                      l2_read_transactions                                  L2 Read Transactions     1166146     1166146     1166146
          1                     l2_write_transactions                                 L2 Write Transactions   128000013   128000013   128000013
          1                           global_hit_rate                     Global Hit Rate in unified l1/tex       0.00%       0.00%       0.00%
          1                            local_hit_rate                                        Local Hit Rate       0.00%       0.00%       0.00%
          1                  gld_requested_throughput                      Requested Global Load Throughput  0.00000B/s  0.00000B/s  0.00000B/s
          1                  gst_requested_throughput                     Requested Global Store Throughput  410.76GB/s  410.76GB/s  410.76GB/s
          1                            gld_throughput                                Global Load Throughput  0.00000B/s  0.00000B/s  0.00000B/s
          1                            gst_throughput                               Global Store Throughput  410.76GB/s  410.76GB/s  410.76GB/s
          1                     local_memory_overhead                                 Local Memory Overhead       0.00%       0.00%       0.00%
          1                        tex_cache_hit_rate                                Unified Cache Hit Rate      50.00%      50.00%      50.00%
          1                      l2_tex_read_hit_rate                           L2 Hit Rate (Texture Reads)       0.00%       0.00%       0.00%
          1                     l2_tex_write_hit_rate                          L2 Hit Rate (Texture Writes)       0.00%       0.00%       0.00%
          1                      tex_cache_throughput                              Unified Cache Throughput  0.00000B/s  0.00000B/s  0.00000B/s
          1                    l2_tex_read_throughput                         L2 Throughput (Texture Reads)  0.00000B/s  0.00000B/s  0.00000B/s
          1                   l2_tex_write_throughput                        L2 Throughput (Texture Writes)  410.76GB/s  410.76GB/s  410.76GB/s
          1                        l2_read_throughput                                 L2 Throughput (Reads)  3.7423GB/s  3.7423GB/s  3.7423GB/s
          1                       l2_write_throughput                                L2 Throughput (Writes)  410.76GB/s  410.76GB/s  410.76GB/s
          1                    sysmem_read_throughput                         System Memory Read Throughput  0.00000B/s  0.00000B/s  0.00000B/s
          1                   sysmem_write_throughput                        System Memory Write Throughput  16.824KB/s  16.824KB/s  16.823KB/s
          1                     local_load_throughput                          Local Memory Load Throughput  0.00000B/s  0.00000B/s  0.00000B/s
          1                    local_store_throughput                         Local Memory Store Throughput  0.00000B/s  0.00000B/s  0.00000B/s
          1                    shared_load_throughput                         Shared Memory Load Throughput  0.00000B/s  0.00000B/s  0.00000B/s
          1                   shared_store_throughput                        Shared Memory Store Throughput  0.00000B/s  0.00000B/s  0.00000B/s
          1                            gld_efficiency                         Global Memory Load Efficiency       0.00%       0.00%       0.00%
          1                            gst_efficiency                        Global Memory Store Efficiency     100.00%     100.00%     100.00%
          1                    tex_cache_transactions                            Unified Cache Transactions           0           0           0
          1                             flop_count_dp           Floating Point Operations(Double Precision)           0           0           0
          1                         flop_count_dp_add       Floating Point Operations(Double Precision Add)           0           0           0
          1                         flop_count_dp_fma       Floating Point Operations(Double Precision FMA)           0           0           0
          1                         flop_count_dp_mul       Floating Point Operations(Double Precision Mul)           0           0           0
          1                             flop_count_sp           Floating Point Operations(Single Precision)           0           0           0
          1                         flop_count_sp_add       Floating Point Operations(Single Precision Add)           0           0           0
          1                         flop_count_sp_fma       Floating Point Operations(Single Precision FMA)           0           0           0
          1                         flop_count_sp_mul        Floating Point Operation(Single Precision Mul)           0           0           0
          1                     flop_count_sp_special   Floating Point Operations(Single Precision Special)           0           0           0
          1                             inst_executed                                 Instructions Executed   800000000   800000000   800000000
          1                               inst_issued                                   Instructions Issued   800051822   800051822   800051822
          1                        sysmem_utilization                             System Memory Utilization     Low (1)     Low (1)     Low (1)
          1                          stall_inst_fetch              Issue Stall Reasons (Instructions Fetch)       6.51%       6.51%       6.51%
          1                     stall_exec_dependency            Issue Stall Reasons (Execution Dependency)      40.29%      40.29%      40.29%
          1                   stall_memory_dependency                    Issue Stall Reasons (Data Request)       0.00%       0.00%       0.00%
          1                             stall_texture                         Issue Stall Reasons (Texture)       0.00%       0.00%       0.00%
          1                                stall_sync                 Issue Stall Reasons (Synchronization)       0.00%       0.00%       0.00%
          1                               stall_other                           Issue Stall Reasons (Other)      27.62%      27.62%      27.62%
          1          stall_constant_memory_dependency              Issue Stall Reasons (Immediate constant)       0.01%       0.01%       0.01%
          1                           stall_pipe_busy                       Issue Stall Reasons (Pipe Busy)       2.22%       2.22%       2.22%
          1                         shared_efficiency                              Shared Memory Efficiency       0.00%       0.00%       0.00%
          1                                inst_fp_32                               FP Instructions(Single)           0           0           0
          1                                inst_fp_64                               FP Instructions(Double)           0           0           0
          1                              inst_integer                                  Integer Instructions  1.4336e+10  1.4336e+10  1.4336e+10
          1                          inst_bit_convert                              Bit-Convert Instructions           0           0           0
          1                              inst_control                             Control-Flow Instructions  1024000000  1024000000  1024000000
          1                        inst_compute_ld_st                               Load/Store Instructions  1024000000  1024000000  1024000000
          1                                 inst_misc                                     Misc Instructions  8192000000  8192000000  8192000000
          1           inst_inter_thread_communication                             Inter-Thread Instructions           0           0           0
          1                               issue_slots                                           Issue Slots   800051822   800051822   800051822
          1                                 cf_issued                      Issued Control-Flow Instructions    64000000    64000000    64000000
          1                               cf_executed                    Executed Control-Flow Instructions    64000000    64000000    64000000
          1                               ldst_issued                        Issued Load/Store Instructions   256000000   256000000   256000000
          1                             ldst_executed                      Executed Load/Store Instructions   160000000   160000000   160000000
          1                       atomic_transactions                                   Atomic Transactions           0           0           0
          1           atomic_transactions_per_request                       Atomic Transactions Per Request    0.000000    0.000000    0.000000
          1                      l2_atomic_throughput                       L2 Throughput (Atomic requests)  0.00000B/s  0.00000B/s  0.00000B/s
          1                    l2_atomic_transactions                     L2 Transactions (Atomic requests)           0           0           0
          1                  l2_tex_read_transactions                       L2 Transactions (Texture Reads)           0           0           0
          1                     stall_memory_throttle                 Issue Stall Reasons (Memory Throttle)      16.39%      16.39%      16.39%
          1                        stall_not_selected                    Issue Stall Reasons (Not Selected)       6.95%       6.95%       6.95%
          1                 l2_tex_write_transactions                      L2 Transactions (Texture Writes)   128000000   128000000   128000000
          1                             flop_count_hp             Floating Point Operations(Half Precision)           0           0           0
          1                         flop_count_hp_add         Floating Point Operations(Half Precision Add)           0           0           0
          1                         flop_count_hp_mul          Floating Point Operation(Half Precision Mul)           0           0           0
          1                         flop_count_hp_fma         Floating Point Operations(Half Precision FMA)           0           0           0
          1                                inst_fp_16                                 HP Instructions(Half)           0           0           0
          1                   sysmem_read_utilization                        System Memory Read Utilization    Idle (0)    Idle (0)    Idle (0)
          1                  sysmem_write_utilization                       System Memory Write Utilization     Low (1)     Low (1)     Low (1)
          1                             sm_efficiency                               Multiprocessor Activity      98.65%      98.65%      98.65%
          1                        achieved_occupancy                                    Achieved Occupancy    0.510443    0.510443    0.510443
          1                                       ipc                                          Executed IPC    1.691812    1.691812    1.691812
          1                                issued_ipc                                            Issued IPC    1.722972    1.722972    1.722972
          1                    issue_slot_utilization                                Issue Slot Utilization      43.07%      43.07%      43.07%
          1                  eligible_warps_per_cycle                       Eligible Warps Per Active Cycle    3.779947    3.779947    3.779947
          1                           tex_utilization                             Unified Cache Utilization    Idle (0)    Idle (0)    Idle (0)
          1                            l2_utilization                                  L2 Cache Utilization     Low (2)     Low (2)     Low (2)
          1                        shared_utilization                             Shared Memory Utilization    Idle (0)    Idle (0)    Idle (0)
          1                       ldst_fu_utilization                  Load/Store Function Unit Utilization     Low (2)     Low (2)     Low (2)
          1                         cf_fu_utilization                Control-Flow Function Unit Utilization     Low (1)     Low (1)     Low (1)
          1                    special_fu_utilization                     Special Function Unit Utilization    Idle (0)    Idle (0)    Idle (0)
          1                        tex_fu_utilization                     Texture Function Unit Utilization     Low (2)     Low (2)     Low (2)
          1           single_precision_fu_utilization            Single-Precision Function Unit Utilization     Low (3)     Low (3)     Low (3)
          1           double_precision_fu_utilization            Double-Precision Function Unit Utilization    Idle (0)    Idle (0)    Idle (0)
          1                        flop_hp_efficiency                            FLOP Efficiency(Peak Half)       0.00%       0.00%       0.00%
          1                        flop_sp_efficiency                          FLOP Efficiency(Peak Single)       0.00%       0.00%       0.00%
          1                        flop_dp_efficiency                          FLOP Efficiency(Peak Double)       0.00%       0.00%       0.00%
          1                    dram_read_transactions                       Device Memory Read Transactions     1176158     1176158     1176158
          1                   dram_write_transactions                      Device Memory Write Transactions   127964313   127964313   127964313
          1                      dram_read_throughput                         Device Memory Read Throughput  3.7744GB/s  3.7744GB/s  3.7744GB/s
          1                     dram_write_throughput                        Device Memory Write Throughput  410.65GB/s  410.65GB/s  410.65GB/s
          1                          dram_utilization                             Device Memory Utilization    Max (10)    Max (10)    Max (10)
          1             half_precision_fu_utilization              Half-Precision Function Unit Utilization    Idle (0)    Idle (0)    Idle (0)
          1                          ecc_transactions                                      ECC Transactions           0           0           0
          1                            ecc_throughput                                        ECC Throughput  0.00000B/s  0.00000B/s  0.00000B/s

transposed

Invocations                               Metric Name                                    Metric Description         Min         Max         Avg
Device "TITAN X (Pascal) (0)"
    Kernel: transp_kernel(int*, unsigned int, unsigned int)
          1                             inst_per_warp                                 Instructions per warp   25.000000   25.000000   25.000000
          1                         branch_efficiency                                     Branch Efficiency     100.00%     100.00%     100.00%
          1                 warp_execution_efficiency                             Warp Execution Efficiency     100.00%     100.00%     100.00%
          1         warp_nonpred_execution_efficiency              Warp Non-Predicated Execution Efficiency      96.00%      96.00%      96.00%
          1                      inst_replay_overhead                           Instruction Replay Overhead    0.000087    0.000087    0.000087
          1      shared_load_transactions_per_request           Shared Memory Load Transactions Per Request    0.000000    0.000000    0.000000
          1     shared_store_transactions_per_request          Shared Memory Store Transactions Per Request    0.000000    0.000000    0.000000
          1       local_load_transactions_per_request            Local Memory Load Transactions Per Request    0.000000    0.000000    0.000000
          1      local_store_transactions_per_request           Local Memory Store Transactions Per Request    0.000000    0.000000    0.000000
          1              gld_transactions_per_request                  Global Load Transactions Per Request    0.000000    0.000000    0.000000
          1              gst_transactions_per_request                 Global Store Transactions Per Request    4.000000    4.000000    4.000000
          1                 shared_store_transactions                             Shared Store Transactions           0           0           0
          1                  shared_load_transactions                              Shared Load Transactions           0           0           0
          1                   local_load_transactions                               Local Load Transactions           0           0           0
          1                  local_store_transactions                              Local Store Transactions           0           0           0
          1                          gld_transactions                              Global Load Transactions           0           0           0
          1                          gst_transactions                             Global Store Transactions   128000000   128000000   128000000
          1                  sysmem_read_transactions                       System Memory Read Transactions           0           0           0
          1                 sysmem_write_transactions                      System Memory Write Transactions           5           5           5
          1                      l2_read_transactions                                  L2 Read Transactions     1770149     1770149     1770149
          1                     l2_write_transactions                                 L2 Write Transactions   128000013   128000013   128000013
          1                           global_hit_rate                     Global Hit Rate in unified l1/tex       0.00%       0.00%       0.00%
          1                            local_hit_rate                                        Local Hit Rate       0.00%       0.00%       0.00%
          1                  gld_requested_throughput                      Requested Global Load Throughput  0.00000B/s  0.00000B/s  0.00000B/s
          1                  gst_requested_throughput                     Requested Global Store Throughput  255.57GB/s  255.57GB/s  255.57GB/s
          1                            gld_throughput                                Global Load Throughput  0.00000B/s  0.00000B/s  0.00000B/s
          1                            gst_throughput                               Global Store Throughput  255.57GB/s  255.57GB/s  255.57GB/s
          1                     local_memory_overhead                                 Local Memory Overhead       0.00%       0.00%       0.00%
          1                        tex_cache_hit_rate                                Unified Cache Hit Rate      50.00%      50.00%      50.00%
          1                      l2_tex_read_hit_rate                           L2 Hit Rate (Texture Reads)       0.00%       0.00%       0.00%
          1                     l2_tex_write_hit_rate                          L2 Hit Rate (Texture Writes)       0.00%       0.00%       0.00%
          1                      tex_cache_throughput                              Unified Cache Throughput  0.00000B/s  0.00000B/s  0.00000B/s
          1                    l2_tex_read_throughput                         L2 Throughput (Texture Reads)  0.00000B/s  0.00000B/s  0.00000B/s
          1                   l2_tex_write_throughput                        L2 Throughput (Texture Writes)  255.57GB/s  255.57GB/s  255.57GB/s
          1                        l2_read_throughput                                 L2 Throughput (Reads)  3.5343GB/s  3.5343GB/s  3.5343GB/s
          1                       l2_write_throughput                                L2 Throughput (Writes)  255.57GB/s  255.57GB/s  255.57GB/s
          1                    sysmem_read_throughput                         System Memory Read Throughput  0.00000B/s  0.00000B/s  0.00000B/s
          1                   sysmem_write_throughput                        System Memory Write Throughput  10.468KB/s  10.468KB/s  10.467KB/s
          1                     local_load_throughput                          Local Memory Load Throughput  0.00000B/s  0.00000B/s  0.00000B/s
          1                    local_store_throughput                         Local Memory Store Throughput  0.00000B/s  0.00000B/s  0.00000B/s
          1                    shared_load_throughput                         Shared Memory Load Throughput  0.00000B/s  0.00000B/s  0.00000B/s
          1                   shared_store_throughput                        Shared Memory Store Throughput  0.00000B/s  0.00000B/s  0.00000B/s
          1                            gld_efficiency                         Global Memory Load Efficiency       0.00%       0.00%       0.00%
          1                            gst_efficiency                        Global Memory Store Efficiency     100.00%     100.00%     100.00%
          1                    tex_cache_transactions                            Unified Cache Transactions           0           0           0
          1                             flop_count_dp           Floating Point Operations(Double Precision)           0           0           0
          1                         flop_count_dp_add       Floating Point Operations(Double Precision Add)           0           0           0
          1                         flop_count_dp_fma       Floating Point Operations(Double Precision FMA)           0           0           0
          1                         flop_count_dp_mul       Floating Point Operations(Double Precision Mul)           0           0           0
          1                             flop_count_sp           Floating Point Operations(Single Precision)           0           0           0
          1                         flop_count_sp_add       Floating Point Operations(Single Precision Add)           0           0           0
          1                         flop_count_sp_fma       Floating Point Operations(Single Precision FMA)           0           0           0
          1                         flop_count_sp_mul        Floating Point Operation(Single Precision Mul)           0           0           0
          1                     flop_count_sp_special   Floating Point Operations(Single Precision Special)           0           0           0
          1                             inst_executed                                 Instructions Executed   800000000   800000000   800000000
          1                               inst_issued                                   Instructions Issued   800069693   800069693   800069693
          1                        sysmem_utilization                             System Memory Utilization     Low (1)     Low (1)     Low (1)
          1                          stall_inst_fetch              Issue Stall Reasons (Instructions Fetch)       4.33%       4.33%       4.33%
          1                     stall_exec_dependency            Issue Stall Reasons (Execution Dependency)      26.25%      26.25%      26.25%
          1                   stall_memory_dependency                    Issue Stall Reasons (Data Request)       0.00%       0.00%       0.00%
          1                             stall_texture                         Issue Stall Reasons (Texture)       0.00%       0.00%       0.00%
          1                                stall_sync                 Issue Stall Reasons (Synchronization)       0.00%       0.00%       0.00%
          1                               stall_other                           Issue Stall Reasons (Other)      18.09%      18.09%      18.09%
          1          stall_constant_memory_dependency              Issue Stall Reasons (Immediate constant)       0.01%       0.01%       0.01%
          1                           stall_pipe_busy                       Issue Stall Reasons (Pipe Busy)       1.37%       1.37%       1.37%
          1                         shared_efficiency                              Shared Memory Efficiency       0.00%       0.00%       0.00%
          1                                inst_fp_32                               FP Instructions(Single)           0           0           0
          1                                inst_fp_64                               FP Instructions(Double)           0           0           0
          1                              inst_integer                                  Integer Instructions  1.4336e+10  1.4336e+10  1.4336e+10
          1                          inst_bit_convert                              Bit-Convert Instructions           0           0           0
          1                              inst_control                             Control-Flow Instructions  1024000000  1024000000  1024000000
          1                        inst_compute_ld_st                               Load/Store Instructions  1024000000  1024000000  1024000000
          1                                 inst_misc                                     Misc Instructions  8192000000  8192000000  8192000000
          1           inst_inter_thread_communication                             Inter-Thread Instructions           0           0           0
          1                               issue_slots                                           Issue Slots   800069693   800069693   800069693
          1                                 cf_issued                      Issued Control-Flow Instructions    64000000    64000000    64000000
          1                               cf_executed                    Executed Control-Flow Instructions    64000000    64000000    64000000
          1                               ldst_issued                        Issued Load/Store Instructions   256000000   256000000   256000000
          1                             ldst_executed                      Executed Load/Store Instructions   160000000   160000000   160000000
          1                       atomic_transactions                                   Atomic Transactions           0           0           0
          1           atomic_transactions_per_request                       Atomic Transactions Per Request    0.000000    0.000000    0.000000
          1                      l2_atomic_throughput                       L2 Throughput (Atomic requests)  0.00000B/s  0.00000B/s  0.00000B/s
          1                    l2_atomic_transactions                     L2 Transactions (Atomic requests)           0           0           0
          1                  l2_tex_read_transactions                       L2 Transactions (Texture Reads)           0           0           0
          1                     stall_memory_throttle                 Issue Stall Reasons (Memory Throttle)      45.46%      45.46%      45.46%
          1                        stall_not_selected                    Issue Stall Reasons (Not Selected)       4.49%       4.49%       4.49%
          1                 l2_tex_write_transactions                      L2 Transactions (Texture Writes)   128000000   128000000   128000000
          1                             flop_count_hp             Floating Point Operations(Half Precision)           0           0           0
          1                         flop_count_hp_add         Floating Point Operations(Half Precision Add)           0           0           0
          1                         flop_count_hp_mul          Floating Point Operation(Half Precision Mul)           0           0           0
          1                         flop_count_hp_fma         Floating Point Operations(Half Precision FMA)           0           0           0
          1                                inst_fp_16                                 HP Instructions(Half)           0           0           0
          1                   sysmem_read_utilization                        System Memory Read Utilization    Idle (0)    Idle (0)    Idle (0)
          1                  sysmem_write_utilization                       System Memory Write Utilization     Low (1)     Low (1)     Low (1)
          1                             sm_efficiency                               Multiprocessor Activity      98.41%      98.41%      98.41%
          1                        achieved_occupancy                                    Achieved Occupancy    0.472508    0.472508    0.472508
          1                                       ipc                                          Executed IPC    1.073875    1.073875    1.073875
          1                                issued_ipc                                            Issued IPC    1.073968    1.073968    1.073968
          1                    issue_slot_utilization                                Issue Slot Utilization      26.85%      26.85%      26.85%
          1                  eligible_warps_per_cycle                       Eligible Warps Per Active Cycle    2.348786    2.348786    2.348786
          1                           tex_utilization                             Unified Cache Utilization    Idle (0)    Idle (0)    Idle (0)
          1                            l2_utilization                                  L2 Cache Utilization     Low (2)     Low (2)     Low (2)
          1                        shared_utilization                             Shared Memory Utilization    Idle (0)    Idle (0)    Idle (0)
          1                       ldst_fu_utilization                  Load/Store Function Unit Utilization     Low (1)     Low (1)     Low (1)
          1                         cf_fu_utilization                Control-Flow Function Unit Utilization     Low (1)     Low (1)     Low (1)
          1                    special_fu_utilization                     Special Function Unit Utilization    Idle (0)    Idle (0)    Idle (0)
          1                        tex_fu_utilization                     Texture Function Unit Utilization     Low (1)     Low (1)     Low (1)
          1           single_precision_fu_utilization            Single-Precision Function Unit Utilization     Low (2)     Low (2)     Low (2)
          1           double_precision_fu_utilization            Double-Precision Function Unit Utilization    Idle (0)    Idle (0)    Idle (0)
          1                        flop_hp_efficiency                            FLOP Efficiency(Peak Half)       0.00%       0.00%       0.00%
          1                        flop_sp_efficiency                          FLOP Efficiency(Peak Single)       0.00%       0.00%       0.00%
          1                        flop_dp_efficiency                          FLOP Efficiency(Peak Double)       0.00%       0.00%       0.00%
          1                    dram_read_transactions                       Device Memory Read Transactions     1847550     1847550     1847550
          1                   dram_write_transactions                      Device Memory Write Transactions   127940913   127940913   127940913
          1                      dram_read_throughput                         Device Memory Read Throughput  3.6889GB/s  3.6889GB/s  3.6889GB/s
          1                     dram_write_throughput                        Device Memory Write Throughput  255.45GB/s  255.45GB/s  255.45GB/s
          1                          dram_utilization                             Device Memory Utilization     Mid (6)     Mid (6)     Mid (6)
          1             half_precision_fu_utilization              Half-Precision Function Unit Utilization    Idle (0)    Idle (0)    Idle (0)
          1                          ecc_transactions                                      ECC Transactions           0           0           0
          1                            ecc_throughput                                        ECC Throughput  0.00000B/s  0.00000B/s  0.00000B/s

From the results, I see the “dram_read_transactions” higher in the transpose, probably an indication that it is not caching as much as in the normal case.

the last result votes for “L2 associativity” issue rather than TLB issue: usually caches aren’t fully associative. Instead each DRAM line is mapped to specific cache associative set which hold only 4-16 last entries from all entries mapped to the same set.

https://en.wikipedia.org/wiki/CPU_cache#Associativity

Interesting. Well it seems that there is not much to do from the developer side.
Many thanks to both.