Cannot allocate the total amount of global memory benchmark

Hi !

I’m currently having issues trying to figure out the reason why I cannot allocate as much as the total amount of memory available on my Tesla C1060. I mean, I would not bear if it was for like a cupple of KB or even MB but in my case it’s like if more than a hundred MB were not really available. Does it sound normal to you?

Here is the code I’m using : maximumAllocatableMemory.cu

[codebox]int main(int argc,char**argv)

{

cudaError_t L_cuda_return;

int count = 0;

int *p_dev;

int nb = 1000;

do{

         L_cuda_return = cudaMalloc((void**)&p_dev,nb*sizeof(int));

        count++;

}while (L_cuda_return == cudaSuccess);

printf(“\n %u allocations really done so that a total amount of %u B has been allocated with %u B in a row”,count-1,(count-1)nbsizeof(int), nb*sizeof(int));

cudaDeviceProp deviceProp;

cudaGetDeviceProperties( &deviceProp, 0 );

 printf("\n The total amount of global memory available is %u B in theory, then it appears that %u B where not actually used\n",deviceProp.totalGlobalMem, deviceProp.totalGlobalMem - (count-1)*nb*sizeof(int));

return 0;

}

[/codebox]

the result is :

1038592 allocations really done so that a total amount of 4154368000 B has been allocated with 4000 B in a row

The total amount of global memory available is 4294770688 B in theory, then it appears that 140402688 B where not actually used

Am I missing something obvious here?

Please try this simple code (that I compile with nvcc 3.0) since I’d really like to know how it’ll end up for you.

Just for the record, I’m not using this card for graphics purpose.

Thanks for you time

How far do you get with nb=1024? How far with larger sizes like nb=1048576?
As cudaMalloc() returns an address aligned to 256 bytes, with nb=1000 you lose 96 bytes per allocation, or about 2.3% of the total memory size.

How far do you get with nb=1024? How far with larger sizes like nb=1048576?
As cudaMalloc() returns an address aligned to 256 bytes, with nb=1000 you lose 96 bytes per allocation, or about 2.3% of the total memory size.

Well firstable thank you for your fast reply!

I tried with nb=1024 as you suggested and results are :

[i]1029120 allocations really done so that a total amount of 4215275520 B has been allocated with 4096 B in a row

The total amount of global memory available is 4294770688 B in theory, then it appears that 79495168 B where not actually used[/i]

Now, and I supposed it won’t really surprised you, with nb=1048576, I get:

[i]1005 allocations really done so that a total amount of 4215275520 B has been allocated with 4194304 B in a row

The total amount of global memory available is 4294770688 B in theory, then it appears that 79495168 B where not actually used[/i]

(starting to find it intersting…)

So you’re saying that I was not actually allocating 4000 bytes every time I called cudaMalloc but 4096, then that would mean that in my first example I actually allocated 99704832 B more than I thought. Still, the question remains for the 40697856 bytes left and so unused.

Now back to the two examples where we know exactly how much memory we allocated, is there a chance you have an explanation for the 79495168 B that I could not allocate? It still sounds a lot to me… (Anyway thank you for you reply, I was not aware of the way the memory was allocated by cudaMalloc)

Well firstable thank you for your fast reply!

I tried with nb=1024 as you suggested and results are :

[i]1029120 allocations really done so that a total amount of 4215275520 B has been allocated with 4096 B in a row

The total amount of global memory available is 4294770688 B in theory, then it appears that 79495168 B where not actually used[/i]

Now, and I supposed it won’t really surprised you, with nb=1048576, I get:

[i]1005 allocations really done so that a total amount of 4215275520 B has been allocated with 4194304 B in a row

The total amount of global memory available is 4294770688 B in theory, then it appears that 79495168 B where not actually used[/i]

(starting to find it intersting…)

So you’re saying that I was not actually allocating 4000 bytes every time I called cudaMalloc but 4096, then that would mean that in my first example I actually allocated 99704832 B more than I thought. Still, the question remains for the 40697856 bytes left and so unused.

Now back to the two examples where we know exactly how much memory we allocated, is there a chance you have an explanation for the 79495168 B that I could not allocate? It still sounds a lot to me… (Anyway thank you for you reply, I was not aware of the way the memory was allocated by cudaMalloc)

I’ve just tried cudaMemGetInfo() and the result kind of surprises me. It stipulates that exactly 40628480 B are simply not allocatable though I called cudaMemGetInfo in the beginning.

Using this method I can for sure manage not to get an “out of memory” error. However, I wonder why the total amount of global memory sold by Nvidia regarding the TeslaC1060 is like 40MB more than what is really available on this card.

I’ve tested it on 3 different TeslaC1060. The result remains unchanged. Seems like a constant “lost of memory”.

I know it only represent 1% of the global memory sold but I suppose I’m curious…!

I’ve just tried cudaMemGetInfo() and the result kind of surprises me. It stipulates that exactly 40628480 B are simply not allocatable though I called cudaMemGetInfo in the beginning.

Using this method I can for sure manage not to get an “out of memory” error. However, I wonder why the total amount of global memory sold by Nvidia regarding the TeslaC1060 is like 40MB more than what is really available on this card.

I’ve tested it on 3 different TeslaC1060. The result remains unchanged. Seems like a constant “lost of memory”.

I know it only represent 1% of the global memory sold but I suppose I’m curious…!

Some of the memory is reserved for the kernel driver and the CUDA context.

Some of the memory is reserved for the kernel driver and the CUDA context.

Well thank you for answering me. I should have thought about it…

However, what I trully would like to know is if this amount of memory reserved is somehow quantifiable?

Well thank you for answering me. I should have thought about it…

However, what I trully would like to know is if this amount of memory reserved is somehow quantifiable?

Not really. Beyond that reserved area (which can change from driver to driver for various reasons), you can also lose some due to memory manager fragmentation and other basic things like that.

Not really. Beyond that reserved area (which can change from driver to driver for various reasons), you can also lose some due to memory manager fragmentation and other basic things like that.

Interesting problem, is there any announced api function to estimate the maximal amount of linear memory available ?

Regards

Interesting problem, is there any announced api function to estimate the maximal amount of linear memory available ?

Regards