Tiled Matrix Multiplication Vastly Slower Than Simple Matrix Multiplication

Hi,

I’m a college student learning CUDA with the textbook Kirk and W. Hwu, “Programming Massively Parallel Processors – A Hands-on Approach”.

I tried to implement tiled matrix multiplication on my own machine. If I’m understanding correctly, the tiled version is supposed to help reduce data transfer bottleneck as it reuses data, and also resolve bank conflicts. But the runtime of my code is somehow much slower than the basic matrix multiplication implementation. I checked my code again and again and compared it to the textbook code, but I can’t find the reason why this is happening.

Could someone explane why is this happening? Below are my device info, my kernel codes, and results:

Device 0: "NVIDIA GeForce RTX 2070 SUPER"
CUDA Driver Version / Runtime Version          11.5 / 11.5 
CUDA Capability Major/Minor version number:    7.5 
#define TILE_SIZE 8
__global__
void tiledMatMult_kernel(float* Cd, float* Ad, float* Bd, int width) {
	__shared__ float tileA[TILE_SIZE][TILE_SIZE];
	__shared__ float tileB[TILE_SIZE][TILE_SIZE];

	int row = blockIdx.y * blockDim.y + threadIdx.y;
	int col = blockIdx.x * blockDim.x + threadIdx.x;
	int tx = threadIdx.x;
	int ty = threadIdx.y;
	float res = 0;

	for (int i = 0; i < ceil((float) width / TILE_SIZE); ++i) {
		if (tx + i * TILE_SIZE < width && row < width) {
			tileA[ty][tx] = Ad[row * width + tx + i * TILE_SIZE];
		}
		else {
			tileA[ty][tx] = 0;
		}

		if (ty + i * TILE_SIZE < width && col < width) {
			tileB[ty][tx] = Bd[(ty + i * TILE_SIZE) * width + col];
		}
		else {
			tileB[ty][tx] = 0;
		}
		__syncthreads();

		for (int k = 0; k < TILE_SIZE; k++) {
			// dot products
			res += tileA[ty][k] * tileB[k][tx];
		}
		__syncthreads();
	}

	if (col < width && row < width) {
		Cd[row * width + col] = res;
	}
		
}
__global__
void matMultKer(float* Cd, float* Ad, float* Bd, int width) {
	int row = threadIdx.y + blockDim.y * blockIdx.y;
	int col = threadIdx.x + blockDim.x * blockIdx.x;

	if (row < width && col < width) {
		float res = 0;
		for (int k = 0; k < width; ++k) {
			res += Ad[row * width + k] * Bd[k * width + col];
		}
		Cd[row * width + col] = res;
	}
}

Result:

.\build\MatMult.exe 128       
Matrix width set to 128
Finished generating random matrices.   
CPU Simple MatMult <- 5342us (5.342ms)     
GPU Simple MatMult <- 871us (0.871ms)
GPU Tiled MatMult <- 233448us (233.448ms)
Result correct!

Below is the nvprof output:

nvprof .\build\MatMult.exe 128
==29724== NVPROF is profiling process 29724, command: .\build\MatMult.exe 128
Matrix width set to 128
Finished generating random matrices.
CPU Simple MatMult <- 5231us (5.231ms)
GPU Simple MatMult <- 986us (0.986ms)
GPU Tiled MatMult <- 227934us (227.934ms)
Result correct!
==29724== Profiling application: .\build\MatMult.exe 128
==29724== Warning: 19 API trace records have same start and end timestamps.
This can happen because of short execution duration of CUDA APIs and low timer resolution on the underlying operating system.
==29724== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   99.91%  227.34ms         1  227.34ms  227.34ms  227.34ms  tiledMatMult_kernel(float*, float*, float*, int)
                    0.05%  108.83us         1  108.83us  108.83us  108.83us  matMultKer(float*, float*, float*, int)
                    0.02%  52.991us         4  13.247us  13.215us  13.280us  [CUDA memcpy HtoD]
                    0.02%  37.823us         2  18.911us  18.463us  19.360us  [CUDA memcpy DtoH]
      API calls:   54.90%  228.27ms         6  38.044ms  17.400us  227.47ms  cudaMemcpy
                   38.75%  161.10ms         7  23.015ms  1.9000us  160.81ms  cudaFree
                    6.25%  25.976ms         1  25.976ms  25.976ms  25.976ms  cuDevicePrimaryCtxRelease
                    0.06%  247.70us         6  41.283us  2.4000us  165.50us  cudaMalloc
                    0.02%  77.000us         2  38.500us  25.500us  51.500us  cuModuleUnload
                    0.02%  73.500us         2  36.750us  17.100us  56.400us  cudaLaunchKernel
                    0.00%  10.100us       101     100ns       0ns     900ns  cuDeviceGetAttribute
                    0.00%  3.4000us         2  1.7000us     100ns  3.3000us  cudaGetLastError
                    0.00%  3.2000us         3  1.0660us     200ns  2.6000us  cuDeviceGetCount
                    0.00%  1.4000us         2     700ns     200ns  1.2000us  cuDeviceGet
                    0.00%     400ns         1     400ns     400ns     400ns  cuDeviceGetName
                    0.00%     300ns         1     300ns     300ns     300ns  cuDeviceGetLuid
                    0.00%     100ns         1     100ns     100ns     100ns  cuDeviceTotalMem
                    0.00%     100ns         1     100ns     100ns     100ns  cuDeviceGetUuid

If you replace your tiled kernel, with the example here, what do you find?

https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#shared-memory-in-matrix-multiplication-c-ab__improvement-reading-additional-data-shared-memory

I am wondering if the conditionals in your example are slowing things down.

1 Like

It turns out the problem is not the kernel. I mistyped the grid size calculation, resulting in incorrect grid dimensions (way larger than needed)… The kernel is perfectly fine, and it is several times faster than the simple algorithm.

Thanks for everyone who put thought in this post.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.