Tiled Matrix Multiplication Vastly Slower Than Simple Matrix Multiplication

PeiyanW · January 11, 2022, 10:29pm

Hi,

I’m a college student learning CUDA with the textbook Kirk and W. Hwu, “Programming Massively Parallel Processors – A Hands-on Approach”.

I tried to implement tiled matrix multiplication on my own machine. If I’m understanding correctly, the tiled version is supposed to help reduce data transfer bottleneck as it reuses data, and also resolve bank conflicts. But the runtime of my code is somehow much slower than the basic matrix multiplication implementation. I checked my code again and again and compared it to the textbook code, but I can’t find the reason why this is happening.

Could someone explane why is this happening? Below are my device info, my kernel codes, and results:

Device 0: "NVIDIA GeForce RTX 2070 SUPER"
CUDA Driver Version / Runtime Version          11.5 / 11.5 
CUDA Capability Major/Minor version number:    7.5

#define TILE_SIZE 8
__global__
void tiledMatMult_kernel(float* Cd, float* Ad, float* Bd, int width) {
	__shared__ float tileA[TILE_SIZE][TILE_SIZE];
	__shared__ float tileB[TILE_SIZE][TILE_SIZE];

	int row = blockIdx.y * blockDim.y + threadIdx.y;
	int col = blockIdx.x * blockDim.x + threadIdx.x;
	int tx = threadIdx.x;
	int ty = threadIdx.y;
	float res = 0;

	for (int i = 0; i < ceil((float) width / TILE_SIZE); ++i) {
		if (tx + i * TILE_SIZE < width && row < width) {
			tileA[ty][tx] = Ad[row * width + tx + i * TILE_SIZE];
		}
		else {
			tileA[ty][tx] = 0;
		}

		if (ty + i * TILE_SIZE < width && col < width) {
			tileB[ty][tx] = Bd[(ty + i * TILE_SIZE) * width + col];
		}
		else {
			tileB[ty][tx] = 0;
		}
		__syncthreads();

		for (int k = 0; k < TILE_SIZE; k++) {
			// dot products
			res += tileA[ty][k] * tileB[k][tx];
		}
		__syncthreads();
	}

	if (col < width && row < width) {
		Cd[row * width + col] = res;
	}
		
}

__global__
void matMultKer(float* Cd, float* Ad, float* Bd, int width) {
	int row = threadIdx.y + blockDim.y * blockIdx.y;
	int col = threadIdx.x + blockDim.x * blockIdx.x;

	if (row < width && col < width) {
		float res = 0;
		for (int k = 0; k < width; ++k) {
			res += Ad[row * width + k] * Bd[k * width + col];
		}
		Cd[row * width + col] = res;
	}
}

Result:

.\build\MatMult.exe 128       
Matrix width set to 128
Finished generating random matrices.   
CPU Simple MatMult <- 5342us (5.342ms)     
GPU Simple MatMult <- 871us (0.871ms)
GPU Tiled MatMult <- 233448us (233.448ms)
Result correct!

PeiyanW · January 11, 2022, 10:35pm

Below is the nvprof output:

nvprof .\build\MatMult.exe 128
==29724== NVPROF is profiling process 29724, command: .\build\MatMult.exe 128
Matrix width set to 128
Finished generating random matrices.
CPU Simple MatMult <- 5231us (5.231ms)
GPU Simple MatMult <- 986us (0.986ms)
GPU Tiled MatMult <- 227934us (227.934ms)
Result correct!
==29724== Profiling application: .\build\MatMult.exe 128
==29724== Warning: 19 API trace records have same start and end timestamps.
This can happen because of short execution duration of CUDA APIs and low timer resolution on the underlying operating system.
==29724== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   99.91%  227.34ms         1  227.34ms  227.34ms  227.34ms  tiledMatMult_kernel(float*, float*, float*, int)
                    0.05%  108.83us         1  108.83us  108.83us  108.83us  matMultKer(float*, float*, float*, int)
                    0.02%  52.991us         4  13.247us  13.215us  13.280us  [CUDA memcpy HtoD]
                    0.02%  37.823us         2  18.911us  18.463us  19.360us  [CUDA memcpy DtoH]
      API calls:   54.90%  228.27ms         6  38.044ms  17.400us  227.47ms  cudaMemcpy
                   38.75%  161.10ms         7  23.015ms  1.9000us  160.81ms  cudaFree
                    6.25%  25.976ms         1  25.976ms  25.976ms  25.976ms  cuDevicePrimaryCtxRelease
                    0.06%  247.70us         6  41.283us  2.4000us  165.50us  cudaMalloc
                    0.02%  77.000us         2  38.500us  25.500us  51.500us  cuModuleUnload
                    0.02%  73.500us         2  36.750us  17.100us  56.400us  cudaLaunchKernel
                    0.00%  10.100us       101     100ns       0ns     900ns  cuDeviceGetAttribute
                    0.00%  3.4000us         2  1.7000us     100ns  3.3000us  cudaGetLastError
                    0.00%  3.2000us         3  1.0660us     200ns  2.6000us  cuDeviceGetCount
                    0.00%  1.4000us         2     700ns     200ns  1.2000us  cuDeviceGet
                    0.00%     400ns         1     400ns     400ns     400ns  cuDeviceGetName
                    0.00%     300ns         1     300ns     300ns     300ns  cuDeviceGetLuid
                    0.00%     100ns         1     100ns     100ns     100ns  cuDeviceTotalMem
                    0.00%     100ns         1     100ns     100ns     100ns  cuDeviceGetUuid

rs277 · January 11, 2022, 11:01pm

If you replace your tiled kernel, with the example here, what do you find?

https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#shared-memory-in-matrix-multiplication-c-ab__improvement-reading-additional-data-shared-memory

I am wondering if the conditionals in your example are slowing things down.

PeiyanW · January 12, 2022, 3:23am

It turns out the problem is not the kernel. I mistyped the grid size calculation, resulting in incorrect grid dimensions (way larger than needed)… The kernel is perfectly fine, and it is several times faster than the simple algorithm.

Thanks for everyone who put thought in this post.

system · January 26, 2022, 3:24am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Tiled matrix multiplication is slower CUDA Programming and Performance	2	1315	October 12, 2021
Why different shape matrix multiplication have different performance? CUDA Programming and Performance	2	746	August 26, 2018
Matrix Multiplication In CUDA CUDA Programming and Performance	6	2527	May 11, 2015
Is there an error in this textbook code? CUDA Programming and Performance	2	583	October 12, 2021
Cuda matrix multiplication too slow CUDA Programming and Performance	5	13325	February 17, 2010
CuBLAS matrix multiplication is slower than the naive one CUDA Programming and Performance cuda	8	735	September 6, 2023
slow kernel CUDA Programming and Performance	4	1444	June 25, 2009
Using more shared memory does not show improvement CUDA Programming and Performance	0	357	November 18, 2020
MatrixMul sample CUDA Programming and Performance	15	3549	February 9, 2018
Why my tiled matrix multiplication has no acceleration over the naive cuda version? CUDA Programming and Performance	1	257	February 1, 2024

Tiled Matrix Multiplication Vastly Slower Than Simple Matrix Multiplication

Related topics