I have programmed a tiled (TILE_WIDTH =32) matrix-matrix multiply following code in [Kirk and Hwu] and a non-tiled version for comparison. The tiled version is indeed showing reduction in global memory access and gst and gld efficiency. But it is taking double the time as the non-tiled version. Why is this ?
TILED
rreddy78@jetson-nano:~/Desktop/Technical$ sudo /usr/local/cuda/bin/nvprof ./matrix_mul_gen_tiled ... Type Time(%) Time Calls Avg Min Max Name GPU activities: 100.00% 33.9590s 101 336.23ms 330.28ms 640.15ms matrixMultiplicationKernel(float const *, int, int, float const *, int, int, float*, int, int)
rreddy78@jetson-nano:~/Desktop/Technical$ sudo /usr/local/cuda/bin/nvprof --metrics gst_efficiency,gld_efficiency,gld_throughput,gst_throughput ./matrix_mul_gen_tiled .. Invocations Metric Name Metric Description Min Max Avg Device "NVIDIA Tegra X1 (0)" Kernel: matrixMultiplicationKernel(float const *, int, int, float const *, int, int, float*, int, int) 1 gst_efficiency Global Memory Store Efficiency 100.00% 100.00% 100.00% 1 gld_efficiency Global Memory Load Efficiency 100.00% 100.00% 100.00% 1 gld_throughput Global Load Throughput 552.94MB/s 552.94MB/s 552.94MB/s 1 gst_throughput Global Store Throughput 8.6396MB/s 8.6396MB/s 8.6396MB/s
NON-TILED
rreddy78@jetson-nano:~/Desktop/Technical$ sudo /usr/local/cuda/bin/nvprof ./matrix_mul_gen_cuda ... Type Time(%) Time Calls Avg Min Max Name GPU activities: 100.00% 17.9302s 101 177.53ms 169.16ms 323.36ms matrixMultiplicationKernel(float const *, int, int, float const *, int, int, float*, int, int)
rreddy78@jetson-nano:~/Desktop/Technical$ sudo /usr/local/cuda/bin/nvprof --metrics gst_efficiency,gld_efficiency,gld_throughput,gst_throughput ./matrix_mul_gen_cuda ... Invocations Metric Name Metric Description Min Max Avg Device "NVIDIA Tegra X1 (0)" Kernel: matrixMultiplicationKernel(float const *, int, int, float const *, int, int, float*, int, int) 1 gst_efficiency Global Memory Store Efficiency 100.00% 100.00% 100.00% 1 gld_efficiency Global Memory Load Efficiency 82.50% 82.50% 82.50% 1 gld_throughput Global Load Throughput 15.194GB/s 15.194GB/s 15.194GB/s 1 gst_throughput Global Store Throughput 12.156MB/s 12.156MB/s 12.156MB/s