I am trying to implement a fast matrix multiplication on GPU using shared memory and I use the common example used in CUDA best practice guidelines which tiles both A&B matrices and load these tiles into shared memory, I am comparing this implementation with CUBLAS and it is much slower than CUBLAS, when profiling the code and looking into its SASS, I can see there are execution dependencies for FFMA instructions (probably due to bank conflicts), and when comparing it to CUBLAS sass, I can see it is resolved in its SASS using .reuse for registers which is register cache I suppose. My question is how can I resolve this problem on C-level and if there is any other techniques to resolve it. Thanks
This is my code
__global__ void gpu_square_matrix_mult(float *d_a, float *d_b, float *d_result, int n)
{
__shared__ float tile_a[BLOCK_SIZE][BLOCK_SIZE];
__shared__ float tile_b[BLOCK_SIZE][BLOCK_SIZE];
int row = blockIdx.y * BLOCK_SIZE + threadIdx.y;
int col = blockIdx.x * BLOCK_SIZE + threadIdx.x;
float tmp = 0;
int idx;
int sub;
for ( sub = 0; sub < gridDim.x; ++sub)
{
idx = row * n + sub * BLOCK_SIZE + threadIdx.x;
if(idx >= n*n)
{
tile_a[threadIdx.y][threadIdx.x] = 0;
}
else
{
tile_a[threadIdx.y][threadIdx.x] = d_a[idx];
}
idx = (sub * BLOCK_SIZE + threadIdx.y) * n + col;
if(idx >= n*n)
{
tile_b[threadIdx.y][threadIdx.x] = 0;
}
else
{
tile_b[threadIdx.y][threadIdx.x] = d_b[idx];
}
__syncthreads();
for (int k = 0; k < BLOCK_SIZE; ++k)
{
tmp += tile_a[threadIdx.y][k] * tile_b[k][threadIdx.x];
}
__syncthreads();
}
if(row < n && col < n)
{
d_result[row * n + col] = tmp;
}
}
@scottgray I have seen your walk through article example of SGEMM on Maxwell Architecture, it is very helpful but unfortunately I am not using Maxwell arch, do you have an idea on how to resolve bank conflicts using Nvidia’s assembler and if the operand collector is taking care of it?
It’s hard to imagine what you’ve done with the code you posted. The shared tile variable types are int. That would create lots of weirdness at the SASS level, not to mention being functionally incorrect.
By the way, to make your posted code easier to read, you can select it and then click on the </> button at the top of the edit window. This will put your code in a more nicely formatted box.
not sure what “works” means. it would not give the correct answer for most float inputs
It’s also hard to imagine that you witnessed a FFMA instruction, but, whatever. I would have expected an integer multiply, not a floating-point multiply.
Hopefully Scott Gray will come by and address your questions. I suspect, however, that if he had found a way to resolve this problem on C-level or any other technique, he would not have resorted to the herculean effort of creating his own assembler and crafting his assembly code.
Also, NVIDIA provides no assembler, that I am aware of.
Yes that’s a typo, It is dim3 dimBlock(BLOCK_SIZE,BLOCK_SIZE);
But I get a lot of execution dependencies for FFMA instructions when I profile it on NVVP as seen in the attachment.