Hey,
I’m learning CUDA and was implementing a naive matrix multiplication along the lines of:
__global__ void naiveMM(double *__restrict__ a,
double *__restrict__ b,
double *__restrict__ c,
int N)
{
int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
double sum = 0.0;
if (col < N && row < N) {
for (int i = 0; i < N; i++) {
sum += a[row * N + i] * b[i * N + col];
}
c[row * N + col] = sum;
}
}
Lets say I run this with matrices of size 3200, once with grid(100,100,1), block(32,32,1)
and once with grid(200,200,1), block(16,16,1)
.
We get the same number of total threads, both threads in a block are divisible by 32 so they can perfectly fit in a warp, and after dividing the threads per block by 32 both are a divisor of 64 to perfectly fill up a SM with 64 warps. So if I understand this post Question about threads per block and warps per SM correctly they should be equal in terms of scheduling.
I get around 1.5 times the performance using the block(16,16,1)
approach though.
When I took a look at how the threads within a block are grouped up in warps I found this section 1. Introduction — CUDA C Programming Guide, which indicates that they are grouped first along the x dimension then along y.
If I’m understanding this correctly that means for the block(32,32,1)
version all threads within a warp have the same row
and sequential col
s, meaning the row value can be broadcast and the cols can be accessed in a nicely coaleced way.
For block(16,16,1)
each warp looks at 2 rows and 16 columns, with the 2 row values being somewhat distant from each other in memory.
Inuitively for me that means, that the block(32,32,1)
version should perform better, but across multiple runs and matrix sizes, this is not the case.
What effect am I missing here and when observing things like these is there any profiler/profiler metric, which could indicate what specifically is better for the block(16,16,1)
version?