cuda error out of memory how to increase the size of matrix in multiplication

hi all,

i just implemented a matrix multiplication code, from the programing guide.

i have 9200M GE in my laptop it has 256MB memory using ubuntu 9.04.

i am providing my code, if i put N = 1024 it works fine. if N = 2048 it gives “cuda out of memory error”.

please help me find the solution how to increase the N .

[codebox]#include <stdio.h>

int N = 2048;

typedef struct {

int width;

int height;

float* elements;

} Matrix;

#define BSIZE 16

// forward declaration of kernel function

global void matkernel(const Matrix, const Matrix, Matrix);

void matmul(const Matrix A, const Matrix B, Matrix C){

float timeStart,timeEnd;	

Matrix d_A;

d_A.width = A.width ;

d_A.height = A.height;

size_t sizeA = A.width * A.height * sizeof(float);

cudaMalloc((void**)&d_A.elements, sizeA);

cudaMemcpy(d_A.elements, A.elements, sizeA, cudaMemcpyHostToDevice );	

Matrix d_B;

d_B.width = B.width ;

d_B.height = B.height;

size_t sizeB = B.width * B.height * sizeof(float);

cudaMalloc((void**)&d_B.elements, sizeB);

cudaMemcpy(d_B.elements, B.elements, sizeB, cudaMemcpyHostToDevice );	

// Allocate C in device memory

Matrix d_C;

d_C.width = C.width; d_C.height = C.height;

size_t sizeC = C.width * C.height * sizeof(float);

cudaMalloc((void**)&d_C.elements, sizeC);

// time to invlove kernel

dim3 threads(BSIZE, BSIZE);

dim3 Grid(B.width / threads.x, A.height / threads.y);

timeStart = clock();


timeEnd = clock();

// read C from device

cudaError_t erro = cudaGetLastError();

 if(erro != cudaSuccess)


    printf("ERROR PREP launch FAIL!\n");


printf("error is %s \n",cudaGetErrorString (erro) );



for(int i = N/2 + (C.width * (N-1)) ; i < C.width*C.height ; i++){

		printf("%f(%d)  ",C.elements[i],i%C.width);

		if(i%C.width == C.width-1)



printf("\nTime = %f ",timeEnd - timeStart );

// free device memory





global void matkernel(Matrix A, Matrix B, Matrix C){

// each elements computes one element of C

// here we accumulate results into Cv

float Cv = 0;

int row = blockIdx.y * blockDim.y + threadIdx.y;

int col = blockIdx.x * blockDim.x + threadIdx.x;

// (row,col) is the elements of the matrix C as well as

// a thread from our pool of threads

for (int e = 0; e < A.width; ++e){

		Cv += A.elements[row * A.width + e]

   		    	 * B.elements[e * B.width + col];


// Matrix A traverses from row (row) starting from e ...

// Matrix B traverses verticaly with a constant col (col)

// A.width is used in the for loop because the resultant matrix C

// has dimentions of width equal to B.width (C.width = B.width)

// (C.height = A.height) because each elements from a row of A is

// multiplied and accumulated with each element of B's height

// therefore C.width = B.height

	C.elements[row * C.width + col] = Cv;


int main (){

// A x B

Matrix A;

Matrix B;

Matrix C;

A.width  = N;

A.height = N;

B.width  = N;

B.height  = N;

C.width = B.width;

C.height = A.height;

//Host_M1 = (int*) malloc(N*N*sizeof(int));

A.elements = (float*) malloc(A.width * A.height * sizeof(float));

B.elements = (float*) malloc(B.width * B.width * sizeof(float));

C.elements = (float*) malloc(C.width * C.width * sizeof(float));

for(int i = 0; i<A.width*A.height; i++){

A.elements[i] = (float) 1;	


for(int i = 0; i<B.width*B.height; i++){

B.elements[i] = (float) 1;	






return 0;



Hi, Can you also post the output of ‘deviceQuery’?

There is 1 device supporting CUDA

Device 0: “GeForce 9200M GE”
CUDA Capability Major revision number: 1
CUDA Capability Minor revision number: 1
Total amount of global memory: 267714560 bytes
Number of multiprocessors: 1
Number of cores: 8
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 8192
Warp size: 32
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 262144 bytes
Texture alignment: 256 bytes
Clock rate: 1.30 GHz
Concurrent copy and execution: No
Run time limit on kernels: Yes
Integrated: Yes
Support host page-locked memory mapping: Yes
Compute mode: Default (multiple host threads can use this device simultaneously)

your code is O.K.

please use cuMemGetInfo() to check your memory usage in device

the code is

[codebox] #include <cuda_runtime_api.h>

#include <cutil.h>

#include <cutil_inline.h>

void matmul(const Matrix A, const Matrix B, Matrix C)


size_t sizeC = C.width * C.height * sizeof(float);	

cudaMalloc((void**)&d_C.elements, sizeC);	


unsigned int free_mem,total_mem, used_mem;

cuMemGetInfo( &free_mem, &total_mem );

used_mem = total_mem-free_mem;

printf("total mem: %0.3f MB, free: %0.3f MB, used : %0.3f MB\n",


	((double)free_mem )/1024.0/1024.0, 

	((double)used_mem )/1024.0/1024.0 ); 

//[end debug]

dim3 threads(BSIZE, BSIZE);	

dim3 Grid(B.width / threads.x, A.height / threads.y);	


link with library C:/CUDA/lib64/cuda.lib (64-bit) or C:/CUDA/lib/cuda.lib (32-bit)

check N < 2048 and see usage of memory in device

Sorry for the delayed response…
I think I know why you were getting those nasty ‘out of memory’ errors. It’s your innocent looking statement ‘#define BSIZE 16’. This statement says that you are going to launch only (and only) 16 threads per block! Try increasing it to 32 or 64 or 128 (I would suggest 128). Everything will work properly for you.
I’ve attached a file which contains this modification made to the code you pasted above. (3.32 KB)

@ teju

but i am using the block dimension as 2D, so i think BSIZE 16 will give 256 threads per block .
with compute capability 1.1 you can have 768 threads per block

my dim is
dim3 threads(BSIZE, BSIZE);

                        16 x 16 = 256 threads

by the way i run our modified code, it gives the error “error is invalid configuration argument”

Ah… sorry I missed that.

But here’s the command line I used to compile and run the program I posted before.

/cygdrive/c/CUDA/bin/nvcc -I C:\CUDA\include -I C:\CUDA\NVIDIA_CUDA_SDK\common\inc --compiler-bindir "C:\Program Files\Microsoft Visual Studio 8\VC\bin" -D _CRT_SECURE_NO_DEPRECATE -D _CRT_NONSTDC_NO_DEPRECATE


I tried with BSIZE to be 64 as well as 128. On my system, it just works fine :)

Here’s the deviceQuery equivalent for my system:

what is the Linux equivalent of
link with library C:/CUDA/lib64/cuda.lib (64-bit) or C:/CUDA/lib/cuda.lib (32-bit)

undefined reference to `cuMemGetInfo’
collect2: ld returned 1 exit status
make: *** […/…/bin/linux/release/multi] Error 1

this is the error i get

i have modified the code a bit. the cudasafecall() returns

bibrak@biebo-laptop:~/NVIDIA_CUDA_SDK/bin/linux/release$ ./multi
error is no error
cudaSafeCall() Runtime API error in file <>, line 78 : the launch timed out and was terminated.

when i put N = 2048 or N > 1024

what do u guys think the problem is.

oop sorry forgot the attachments there they are (1022 Bytes) (3.06 KB)