Allocating large arrays.


I have a Matrix Multiplication like code that allocates arrays using " cudaMalloc()". As long as the array sizes are small, every thing’s fine. But for large array size, KERNEL fails, not the cudaMalloc! Anyone has any idea? I need to run the code for large array sizes, like 4K*8K.


  • eka

You are probably using access resources. And also check you calling configuration. Or elaborate your situation more.

An update: it also fails at cudaMalloc:

100: float* d_B;

101: unsigned int mem_size_B = sizeof(float) * 33554432; /32M Word/

102: cutilSafeCall(cudaMalloc((void**) &d_B, mem_size_B));

cudaSafeCall() Runtime API error in file <>, line 101 : out of memory.

How much memory does your card have available? Also, do you have a display running off it? Are you allocating any other memory anywhere? Are you not freeing previously allocated memory?

Good questions!

I am running this on GT 8800, which has 512 MB global memory. So it is larger than 128 MB memory that I want to allocate. The prev. allocated memories are all being freed.

Maybe the rest of memory is occupied by the display, as you pointed out.

I’d like to know if anyone else could have allocate that much memory on other boards with higher memory or those that are not running display.


  • eka

I have similar problem with a Tesla C1060 card having 4GB of memory.
cudaMalloc does not fail but when the kernel is invoked and i try to access the memory, the kernel crashes.

The code block is as below;
unsigned short* d_ImagePointer;
cutilSafeCall(cudaMalloc((void**) &d_ImagePointer, sizeof(unsigned short)102410248)); //16 MB
cutilSafeCall(cudaMemset(d_ImagePointer,0,sizeof(unsigned short)10241024
cudaMemcpy(d_lpwSaveNpoint,h_ImagePointer,sizeof(unsigned short)10241024*8, cudaMemcpyHostToDevice); //h_ImagePointer above is having a valid pointer to 16MB of host side memory

global kernel(unsigned short* d_ImagePointer)
int i;
int bufferlen = 102410248; //2 Bytes each index
if (*(d_ImagePointer+i) == 0)
//do something
//do something else

With this kernel crashes.
Does cudaMalloc has any limitation in terms of memory size?

I have tried to test the limit of my Matrix multiplication and it seems to become one of the following:

  1. Timelimit exceeds some watchdog timer in windows and hence windows says the driver stops responding etc - So the matrix multiplication just takes too long…
  2. Memory limit - but it is no problem allocating up to approximately 200 mb for each matrix, though it fails to have all of the three (got a 8600m with 512 mb ram)…

In order to investigate this further I tried to run the sample on my desktop which has a 8800 GTX card and on this one I could run with larger arrays simple due to more processing power…

I guess the best way to handle that problem, is to make some host code that divides the matrix into sub matrices of certain sizes and then run those on the GPU.

Just my experience so far :)

Regards, Tainruk