How to improve this matrix multiplication code in CUDA?

Hi, I’m implement a multiplication code from a book.
But, I was in a trouble because the size of the matrix could not be large…

I assume that this matrix is square matrix and A X B = C --> [NN] X [NN] = [N*N]
and Also I use a tile for block index.

How can I set the size of the matrix to be 1000 X 1000 or more and more.
((I use Titan X and this GPU can support 1024 thread per block))

#include “device_launch_parameters.h”
#include “book.h”
#include “cuda_runtime.h”

#define N 128
#define TILE_WIDTH 16

global void matMulkernel(int *a, int *b, int *c, int width){

int Row = blockIdx.y*TILE_WIDTH + threadIdx.y;
int Col = blockIdx.x*TILE_WIDTH + threadIdx.x;

float Pvalue = 0;

for(int k = 0; k < width ; ++k)
	Pvalue += a[Row*width+k] * b[k*width+Col];

c[Row*width+Col] = Pvalue;


int main(){

int a[N*N], b[N*N], c[N*N];
int *dev_a, *dev_b, *dev_c;
int _size = N*N*sizeof(int);

for(int i = 0; i < N*N; i++){
	a[i] = 2;
	b[i] = 3;
	c[i] = 0;

HANDLE_ERROR(cudaMalloc((void**)&dev_a, _size));
HANDLE_ERROR(cudaMalloc((void**)&dev_b, _size));
HANDLE_ERROR(cudaMalloc((void**)&dev_c, _size));

HANDLE_ERROR(cudaMemcpy(dev_a, a, _size, cudaMemcpyHostToDevice));
HANDLE_ERROR(cudaMemcpy(dev_b, b, _size, cudaMemcpyHostToDevice));

dim3 dimGrids(N/TILE_WIDTH, N/TILE_WIDTH);
dim3 dimBlocks(TILE_WIDTH, TILE_WIDTH);

matMulkernel<<<dimGrids, dimBlocks>>>(dev_a, dev_b, dev_c, N);

HANDLE_ERROR(cudaMemcpy(c, dev_c, _size, cudaMemcpyDeviceToHost));

for(int i = 0; i< N*N; i++)
	printf("%d \n", c[i]);


return 0;


“How can I set the size of the matrix to be 1000 X 1000 or more and more”

well, what is the biggest size you currently achieve?
and what is constraining this?
i suspect device global memory, as your kernel is conditional on the tile size mostly, and not local/ shared memory or sm characteristics

thanks your reply!

the biggest size I got is N -> 280, tile -> 10 or 20 ( tile can be other values)
when I just tried N to 290 or 300, it didn’t work.

And, I couldn’t understand what you mean "I suspect~ ".

If you don’t mind, I would like you to explain what I am wrong.
Thank you!

i actually expected a N >> 280

“when I just tried N to 290 or 300, it didn’t work”

in what way? what happened? an output error message perhaps?

This is going to be a problem for large N:

int a[N*N], b[N*N], c[N*N];

stack based variables have size limits that are much lower than dynamically allocated heap based variables.

If you replace the above line of code with the following:

int *a, *b, *c;
a = (int *)malloc(N*N*sizeof(int));
b = (int *)malloc(N*N*sizeof(int));
c = (int *)malloc(N*N*sizeof(int));

I think you’ll have better results.

And as little_jimmy said, “it didn’t work” is not very helpful in a forum like this. Be specific.
Probably best just to paste the actual error output you are getting into your question.

Thank you for your reply and I need to be more specific to get more information!
Your answer is so helpful for me!

thanks txbob!
I’ve fixed my problem by your advice!
Then, as you mentioned above, I need to be specific for getting good feedback.
I will do it next time, thank you!