Tiled matrix multiplication is slower

I have programmed a tiled (TILE_WIDTH =32) matrix-matrix multiply following code in [Kirk and Hwu] and a non-tiled version for comparison. The tiled version is indeed showing reduction in global memory access and gst and gld efficiency. But it is taking double the time as the non-tiled version. Why is this ?

TILED

rreddy78@jetson-nano:~/Desktop/Technical$ sudo /usr/local/cuda/bin/nvprof ./matrix_mul_gen_tiled 
...
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:  100.00%  33.9590s       101  336.23ms  330.28ms  640.15ms  matrixMultiplicationKernel(float const *, int, int, float const *, int, int, float*, int, int)
rreddy78@jetson-nano:~/Desktop/Technical$ sudo /usr/local/cuda/bin/nvprof --metrics gst_efficiency,gld_efficiency,gld_throughput,gst_throughput ./matrix_mul_gen_tiled
..
Invocations                               Metric Name                        Metric Description         Min         Max         Avg
Device "NVIDIA Tegra X1 (0)"
    Kernel: matrixMultiplicationKernel(float const *, int, int, float const *, int, int, float*, int, int)
          1                            gst_efficiency            Global Memory Store Efficiency     100.00%     100.00%     100.00%
          1                            gld_efficiency             Global Memory Load Efficiency     100.00%     100.00%     100.00%
          1                            gld_throughput                    Global Load Throughput  552.94MB/s  552.94MB/s  552.94MB/s
          1                            gst_throughput                   Global Store Throughput  8.6396MB/s  8.6396MB/s  8.6396MB/s

NON-TILED

rreddy78@jetson-nano:~/Desktop/Technical$ sudo /usr/local/cuda/bin/nvprof ./matrix_mul_gen_cuda 
...
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:  100.00%  17.9302s       101  177.53ms  169.16ms  323.36ms  matrixMultiplicationKernel(float const *, int, int, float const *, int, int, float*, int, int)
rreddy78@jetson-nano:~/Desktop/Technical$ sudo /usr/local/cuda/bin/nvprof --metrics gst_efficiency,gld_efficiency,gld_throughput,gst_throughput ./matrix_mul_gen_cuda
...
Invocations                               Metric Name                        Metric Description         Min         Max         Avg
Device "NVIDIA Tegra X1 (0)"
    Kernel: matrixMultiplicationKernel(float const *, int, int, float const *, int, int, float*, int, int)
          1                            gst_efficiency            Global Memory Store Efficiency     100.00%     100.00%     100.00%
          1                            gld_efficiency             Global Memory Load Efficiency      82.50%      82.50%      82.50%
          1                            gld_throughput                    Global Load Throughput  15.194GB/s  15.194GB/s  15.194GB/s
          1                            gst_throughput                   Global Store Throughput  12.156MB/s  12.156MB/s  12.156MB/s

Solved.
The tiled version is indeed faster…(half of the non-tiled version)

            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:  100.00%  8.02354s       101  79.441ms  75.651ms  166.92ms  matrixMultiplicationKernel(float const *, int, int, float const *, int, int, float*, int, int)

The problem was that I thought I could transpose the second matrix at the time of loading into the tile memory. I thought its better for the banked access. Rather it seems to be causing more bank conflicts and ending up slower than non-tiled.

Now I have

            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:  100.00%  8.02354s       101  79.441ms  75.651ms  166.92ms  matrixMultiplicationKernel(float const *, int, int, float const *, int, int, float*, int, int)

And

  Kernel: matrixMultiplicationKernel(float const *, int, int, float const *, int, int, float*, int, int)
          1                            gst_efficiency            Global Memory Store Efficiency     100.00%     100.00%     100.00%
          1                            gld_efficiency             Global Memory Load Efficiency     100.00%     100.00%     100.00%
          1                            gld_throughput                    Global Load Throughput  3.0995GB/s  3.0995GB/s  3.0995GB/s
          1                            gst_throughput                   Global Store Throughput  49.592MB/s  49.592MB/s  49.592MB/s

Perfect…!

Sorry for the bother…!

But this is a learning on shared memory and banking.