cuda error out of memory how to increase the size of matrix in multiplication

biebo · September 30, 2009, 5:33am

hi all,

i just implemented a matrix multiplication code, from the programing guide.

i have 9200M GE in my laptop it has 256MB memory using ubuntu 9.04.

i am providing my code, if i put N = 1024 it works fine. if N = 2048 it gives “cuda out of memory error”.

please help me find the solution how to increase the N .

[codebox]#include <stdio.h>

int N = 2048;

typedef struct {

int width;

int height;

float* elements;

} Matrix;

#define BSIZE 16

// forward declaration of kernel function

global void matkernel(const Matrix, const Matrix, Matrix);

void matmul(const Matrix A, const Matrix B, Matrix C){

float timeStart,timeEnd;	

Matrix d_A;

d_A.width = A.width ;

d_A.height = A.height;

size_t sizeA = A.width * A.height * sizeof(float);

cudaMalloc((void**)&d_A.elements, sizeA);

cudaMemcpy(d_A.elements, A.elements, sizeA, cudaMemcpyHostToDevice );	



Matrix d_B;

d_B.width = B.width ;

d_B.height = B.height;

size_t sizeB = B.width * B.height * sizeof(float);

cudaMalloc((void**)&d_B.elements, sizeB);

cudaMemcpy(d_B.elements, B.elements, sizeB, cudaMemcpyHostToDevice );	



// Allocate C in device memory

Matrix d_C;

d_C.width = C.width; d_C.height = C.height;

size_t sizeC = C.width * C.height * sizeof(float);

cudaMalloc((void**)&d_C.elements, sizeC);

// time to invlove kernel

dim3 threads(BSIZE, BSIZE);

dim3 Grid(B.width / threads.x, A.height / threads.y);

timeStart = clock();

matkernel<<<Grid,threads>>>(d_A,d_B,d_C);

timeEnd = clock();

// read C from device

cudaError_t erro = cudaGetLastError();

 if(erro != cudaSuccess)

 {

    printf("ERROR PREP launch FAIL!\n");

 }

printf("error is %s \n",cudaGetErrorString (erro) );



cudaMemcpy(C.elements,d_C.elements,sizeC,cudaMemcpyDeviceToH

ost);

for(int i = N/2 + (C.width * (N-1)) ; i < C.width*C.height ; i++){



		printf("%f(%d)  ",C.elements[i],i%C.width);

		if(i%C.width == C.width-1)

		printf("\n");

	}

printf("\nTime = %f ",timeEnd - timeStart );

// free device memory

cudaFree(d_A.elements);

cudaFree(d_B.elements);

cudaFree(d_C.elements);

}

global void matkernel(Matrix A, Matrix B, Matrix C){

// each elements computes one element of C

// here we accumulate results into Cv

float Cv = 0;

int row = blockIdx.y * blockDim.y + threadIdx.y;

int col = blockIdx.x * blockDim.x + threadIdx.x;



// (row,col) is the elements of the matrix C as well as

// a thread from our pool of threads

for (int e = 0; e < A.width; ++e){

		Cv += A.elements[row * A.width + e]

   		    	 * B.elements[e * B.width + col];

	}

// Matrix A traverses from row (row) starting from e ...

// Matrix B traverses verticaly with a constant col (col)

// A.width is used in the for loop because the resultant matrix C

// has dimentions of width equal to B.width (C.width = B.width)

// (C.height = A.height) because each elements from a row of A is

// multiplied and accumulated with each element of B's height

// therefore C.width = B.height

	C.elements[row * C.width + col] = Cv;

}

int main (){

// A x B



Matrix A;

Matrix B;

Matrix C;





A.width  = N;

A.height = N;



B.width  = N;

B.height  = N;

C.width = B.width;

C.height = A.height;

//Host_M1 = (int*) malloc(N*N*sizeof(int));

A.elements = (float*) malloc(A.width * A.height * sizeof(float));

B.elements = (float*) malloc(B.width * B.width * sizeof(float));

C.elements = (float*) malloc(C.width * C.width * sizeof(float));

for(int i = 0; i<A.width*A.height; i++){

A.elements[i] = (float) 1;	

}

for(int i = 0; i<B.width*B.height; i++){

B.elements[i] = (float) 1;	

}

matmul(A,B,C);

free(A.elements);

free(B.elements);

free(C.elements);

return 0;

}

[/codebox]

_teju · September 30, 2009, 5:41am

Hi, Can you also post the output of ‘deviceQuery’?

biebo · September 30, 2009, 11:06am

There is 1 device supporting CUDA

Device 0: “GeForce 9200M GE”
CUDA Capability Major revision number: 1
CUDA Capability Minor revision number: 1
Total amount of global memory: 267714560 bytes
Number of multiprocessors: 1
Number of cores: 8
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 8192
Warp size: 32
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 262144 bytes
Texture alignment: 256 bytes
Clock rate: 1.30 GHz
Concurrent copy and execution: No
Run time limit on kernels: Yes
Integrated: Yes
Support host page-locked memory mapping: Yes
Compute mode: Default (multiple host threads can use this device simultaneously)

LSChien · October 1, 2009, 1:08am

your code is O.K.

please use cuMemGetInfo() to check your memory usage in device

the code is

[codebox] #include <cuda_runtime_api.h>

#include <cutil.h>

#include <cutil_inline.h>

void matmul(const Matrix A, const Matrix B, Matrix C)

{

…

size_t sizeC = C.width * C.height * sizeof(float);	

cudaMalloc((void**)&d_C.elements, sizeC);

//[debug]

unsigned int free_mem,total_mem, used_mem;

cuMemGetInfo( &free_mem, &total_mem );

used_mem = total_mem-free_mem;

printf("total mem: %0.3f MB, free: %0.3f MB, used : %0.3f MB\n",

	((double)total_mem)/1024.0/1024.0,

	((double)free_mem )/1024.0/1024.0, 

	((double)used_mem )/1024.0/1024.0 );

//[end debug]

dim3 threads(BSIZE, BSIZE);	

dim3 Grid(B.width / threads.x, A.height / threads.y);	

...[/codebox]

link with library C:/CUDA/lib64/cuda.lib (64-bit) or C:/CUDA/lib/cuda.lib (32-bit)

check N < 2048 and see usage of memory in device

_teju · October 1, 2009, 4:01am

Sorry for the delayed response…
I think I know why you were getting those nasty ‘out of memory’ errors. It’s your innocent looking statement ‘#define BSIZE 16’. This statement says that you are going to launch only (and only) 16 threads per block! Try increasing it to 32 or 64 or 128 (I would suggest 128). Everything will work properly for you.
I’ve attached a file which contains this modification made to the code you pasted above.
ff.cu (3.32 KB)

biebo · October 1, 2009, 5:10am

@ teju

but i am using the block dimension as 2D, so i think BSIZE 16 will give 256 threads per block .
with compute capability 1.1 you can have 768 threads per block

my dim is
dim3 threads(BSIZE, BSIZE);

                        16 x 16 = 256 threads

by the way i run our modified code, it gives the error “error is invalid configuration argument”

_teju · October 1, 2009, 6:32am

Ah… sorry I missed that.

But here’s the command line I used to compile and run the program I posted before.

/cygdrive/c/CUDA/bin/nvcc -I C:\CUDA\include -I C:\CUDA\NVIDIA_CUDA_SDK\common\inc --compiler-bindir "C:\Program Files\Microsoft Visual Studio 8\VC\bin" -D _CRT_SECURE_NO_DEPRECATE -D _CRT_NONSTDC_NO_DEPRECATE ff.cu

./a.exe

I tried with BSIZE to be 64 as well as 128. On my system, it just works fine :)

Here’s the deviceQuery equivalent for my system:

— DEVICE NUMBER 0 —

Device Name : Quadro NVS 290

Total global mem. on device : 268107776 bytes

Max. shared mem. per thread blk : 16384 bytes

Max. no. of 32-bit regs. per blk : 8192

Warp size : 32

Max. pitch size : 262144 bytes

Max. threads per block : 512

Max. size of each dim. of a blk : 512,512,64

Max. size of each dim. of a grid : 65535,65535,1

Total constant mem. on device : 65536 bytes

Major and Minor rev. nos. : 1.1

Clock frequency : 918000 kHz

Alignment requirement : 256 bytes

Concurrent copy host ↔ device : POSSIBLE

No. of multiprocessors on device : 2

Runtime limit for kernels : YES

GPU-kind : Discrete

Can map host memory? : NO
 --- END OF DEVICE NUMBER 0 ---

biebo · October 1, 2009, 10:03am

what is the Linux equivalent of
link with library C:/CUDA/lib64/cuda.lib (64-bit) or C:/CUDA/lib/cuda.lib (32-bit)

biebo · October 1, 2009, 10:10am

undefined reference to `cuMemGetInfo’
collect2: ld returned 1 exit status
make: *** […/…/bin/linux/release/multi] Error 1

this is the error i get

biebo · October 1, 2009, 11:56am

i have modified the code a bit. the cudasafecall() returns

bibrak@biebo-laptop:~/NVIDIA_CUDA_SDK/bin/linux/release$ ./multi
error is no error
cudaSafeCall() Runtime API error in file <multiguide.cu>, line 78 : the launch timed out and was terminated.

when i put N = 2048 or N > 1024

what do u guys think the problem is.

biebo · October 1, 2009, 12:00pm

oop sorry forgot the attachments there they are
multiguidekernel.cu (1022 Bytes)
multiguide.cu (3.06 KB)

Topic		Replies	Views
matrix multiplication with its transpose in cuda(cudamemcpy from device to host not working) . CUDA Programming and Performance	6	1756	October 5, 2018
Problems of matrix multiplication With and without CUDA CUDA Programming and Performance	15	10000	January 18, 2012
Cuda matrix multiplication too slow CUDA Programming and Performance	5	13330	February 17, 2010
Matrix multiplication ERRORS & few thoughts on CUDA Basic programming errors need correction CUDA Programming and Performance	14	13280	January 24, 2009
2 problems with my matrix multiplication code CUDA Programming and Performance	5	1968	April 14, 2012
Why different shape matrix multiplication have different performance? CUDA Programming and Performance	2	762	August 26, 2018
matrix multiplication CUDA Programming and Performance	10	3801	March 7, 2010
cannot resolve the error in running multi-block, mutli-threads kernel CUDA Programming and Performance	5	1061	February 5, 2014
Subtle problem with matrix-matrix multiplication CUDA Programming and Performance	10	1153	December 12, 2020
Cuda application crashes works fine for small data and crashes for big data CUDA Developer Tools	0	361	December 8, 2020

cuda error out of memory how to increase the size of matrix in multiplication

Related topics