Maximum matrix size for matrix multiplication operation on GeForce GTX 960M

What could be the maximum number of matrix dimension we can deal to do matrix multiplication operation?
Is there any way to calculate the matrix size based on the device properties to perform the operation on it?

This will depend more on the available memory of your device.
If you multiply 2 matrices of 134217728 single precision (4 bytes) elements each, regardless of the dimensions, each will take 512MB. If the result is stored in one of these arrays, you will need 1GB of free memory for the arrays, but if a 3rd array is used for storage, then you need more memory for this one too.

If you need more information on a library’s details, such as cuBLAS, then its documentation has to be consulted.

Would the use of CUDA managed memory for the matrices allow to multiply larger matrices than the device memory is able to hold (at a hefty performance penalty obviously) ?


CBuchner, when I attempt to allocate more memory than there is free with cudaMallocManaged, Windows or Linux, the thing will return an error even before I call cudaMemset to initialize with 0.

I think you were recently into some obscure experiments attempting to somehow hack the managed memory references? Did you succeed?

Allocating more device memory than what is physically available should be possible on CUDA 9 or 10 with a Pascal or Volta device on linux (and assuming you do not exceed the amount of host memory available, which is the backing store in this case). Subject to those restrictions, oversubscription of device memory should be possible, and I know of no reason why it should not be possible to use oversubscription in a matrix multiply scenario, but I haven’t specifically tried it myself.

If you get an out of memory error when you attempt to oversubscribe the device, it means you have not met one or more of the requirements above. You cannot do it on windows. and you cannot do it on a Maxwell or earlier device.

the GTX 960M is a Maxwell part.


I was trying to answer your question in a general way.

With respect to the specific case indicated by OP, there is no benefit (e.g. size-wise) by using managed memory. Oversubscription is not possible on a GTX 960M

After looking at the comments above, I tried a couple of experiments and found one more issue of copying the device memory to host. The issue was when I’m dealing with the integer matrix of size 1504x1504, the resultant matrix is not showing any output (this output I’m printing after the kernel is called) but when I tried to print the value from the kernel then those are getting printed with correct values. I’m guessing the cudaMemcpy() function is not able to copy the device memory to host after the execution. Is there any limit of cudaMemcpy() for transfer of device memory to host?
The device used for this operation is the same GTX 960M.

There are no limits of the kind you are imagining.

Run the failing case with cuda-memcheck. Or use ordinary debugging techniques.

Thanks for the reply.
I’m adding my code here for matrix multiplication. All I’m getting is 0 as the output for the matrix size like 512, 1024, 2912, etc. But this code is giving me the correct output for matrix size like 320, 480, 160, 33, etc.
I really have no idea why it is working for some set of matrix size and why not for other.

#include “cuda_runtime.h”
#include “device_launch_parameters.h”

#include <stdio.h>

cudaError_t MultiplyWithCuda(const int *a, const int *b, int *c, unsigned int size);

global void MultiplyKernel(const int *a, const int *b, int *c, int size)
int Pvalue = 0;
int i = blockIdx.x * blockDim.x + threadIdx.x;
int j = blockIdx.x * blockDim.y + threadIdx.y;
if (i < size && j < size)
for (int m = 0; m < size; m++)
Pvalue = 0;
for (int k = 0; k < size; k++)
Pvalue += a[m * size + k] * b[k * size + m];
c[m * size + j] = Pvalue;
//printf(“m = %d\tj = %d [%d * %d + %d] = %d\n”, m, j, m, size, j, c[m * size + j]);

int main()
const int width = 512;
int M[width * width], N[width * width], P[width * width];
int count = 5;
for (int i = 0; i < width * width; i++)
M[i] = count;
N[i] = count;
P[i] = 0;
MultiplyWithCuda(M, N, P, width);
return 0;

cudaError_t MultiplyWithCuda(const int a, const int b, int c, unsigned int size)
cudaError_t error;
int s = size * size * sizeof(int);
int Md, Nd, Pd;
error = cudaSetDevice(0);
printf("%s\n", cudaGetErrorString(error));
)&Md, s);
)&Nd, s);
)&Pd, s);
cudaMemcpy((void *)Md, (void *)a, s, cudaMemcpyHostToDevice);
cudaMemcpy((void *)Nd, (void *)b, s, cudaMemcpyHostToDevice);
cudaMemcpy((void *)Pd, (void *)c, s, cudaMemcpyHostToDevice);

dim3 dimBlock(32, 32, 1);
dim3 dimGrid(16, 1, 1);
float timeSpent = 0.0f;
cudaEvent_t start, stop;
cudaEventRecord(start, 0);

MultiplyKernel << <dimGrid, dimBlock >> >(Md, Nd, Pd, size);

cudaEventRecord(stop, 0);

cudaEventElapsedTime(&timeSpent, start, stop);
cudaMemcpy((void *)c, (void *)Pd, s, cudaMemcpyDeviceToHost);

for (int i = 0; i < size; i++)
	for (int j = 0; j < size; j++)
		printf(" %d ", c[i * size + j]);

printf("\nTime spent for execution : %f\n", timeSpent * 0.001);
error = cudaGetLastError();
printf("%d", error);
printf("Error : %s\n\n", cudaGetErrorString(error));
return error;


Your code seems to run correctly for me. It outputs 12800 (many times).

You appear to be running on windows. You may be running into a WDDM TDR timeout. (please google that)

After changing the TDR parameter, I’m able to see the output. Thanks for this help.
Now, I’m facing the issue of performance drop at GPU.
I tried to execute the same set of matrix size on CPU and the same on GPU but the CPU is beating GPU. The time taken by the GPU is more than the CPU. Is this TDR impacting the performance?
Earlier I was getting the better performance by GPU over CPU.

The TDR should have no effect on the runtime behavior of a kernel, assuming the TDR does not terminate the kernel.

This is a naive matrix multiply. If you want a fast matrix multiply, use CUBLAS. GPUs will often be faster at floating point operations than at integer operations. The CUBLAS routines will expect you to pass/use floating point data.