Limit on cublasAlloc?

Eric_Heinemeyer · October 1, 2010, 10:17am

Hi,

until recently I was under the impression that I should be able to allocate all of the GPU memory.

The GTX 285 has 16373 x 64 kiB total global memory as shown by deviceQuery.

That’s a little over 1023 MiB. When allocating memory in a simple loop (cf. code below)

I can allocate at most 984 MiB. Thus I’m missing 629 x 64KiB (about 39 MiB).

On a GTX 480, which has 24569 x 64 KiB total global memory (about 1535 MiB), I can

allocate 1429 MiB, thus missing 1705 x 64 KiB (about 107 MiB).

Has anybody else observed this behaviour? Does somebody know what this is like on

the Tesla C2050 or C2070? Can somebody give an explanation for this?

I tested this with CuBLAS 3.2 RC with the latest 64 bit linux driver and the following fortran programm

using the standard fortran wrapper.

Cheers,

Eric

[codebox]

  PROGRAM TEST_MEM

  IMPLICIT NONE

INTEGER GPURAM, MIBYTE

  PARAMETER (GPURAM=1023, MIBYTE=1024*1024)

INTEGER IX,STAT

  EXTERNAL  CUBLAS_INIT, CUBLAS_SHUTDOWN, CUBLAS_ALLOC, CUBLAS_FREE

  INTEGER*4 CUBLAS_INIT, CUBLAS_SHUTDOWN, CUBLAS_ALLOC, CUBLAS_FREE

  INTEGER*8 DEVLOCB

! > Initialize CuBLAS

STAT = CUBLAS_INIT()

  IF (STAT .NE. 0)  WRITE(*,*) 'cublas init failed'

IX = 1

100 STAT = CUBLAS_ALLOC(1,MIBYTE, DEVLOCB)

  IF (STAT .NE. 0)  THEN 

     WRITE(*,*) 'allocation failed for ix =',IX

  ELSE

     IX = IX + 1

     GOTO 100

  END IF

! > Shutdown CuBLAS

STAT = CUBLAS_SHUTDOWN()

  IF (STAT .NE. 0)  WRITE(*,*) 'cublas shutdown failed'

END

[/codebox]

avidday · October 1, 2010, 10:57am

The CUDA context uses something in the order of 40-50Mb of memory. That is where the “missing” memory is going.

avidday · October 1, 2010, 10:57am

The CUDA context uses something in the order of 40-50Mb of memory. That is where the “missing” memory is going.

Eric_Heinemeyer · October 1, 2010, 12:09pm

I can live with that, but I do not understand why it increases in case of the GTX480.

Another thing that I just tried is to query the available memory with cudaMemGetInfo

The problem I ran into was that I could not allocate the total free memory.

I had to downsize the requested memory by 4 MiB to successfully allocate the memory.

A subsequent call to cudaMemGetInfo showed the 4 MiB as free memory, but I could not

allocate another byte from it.???

IMHO the function MUST give me a reliable size that can actually be allocated.

[codebox]

heinemey@gpu-1:~/tmp/nvidia> a.out

FREEMEM,TOTALMEM= 1032392448 1073020928

FREEMEM,TOTALMEM= 4096 1073020928

allocation failed 0

FREEMEM,TOTALMEM= 4096 1073020928

[/codebox]

Cheers,

Eric

Eric_Heinemeyer · October 1, 2010, 12:09pm

I can live with that, but I do not understand why it increases in case of the GTX480.

Another thing that I just tried is to query the available memory with cudaMemGetInfo

The problem I ran into was that I could not allocate the total free memory.

I had to downsize the requested memory by 4 MiB to successfully allocate the memory.

A subsequent call to cudaMemGetInfo showed the 4 MiB as free memory, but I could not

allocate another byte from it.???

IMHO the function MUST give me a reliable size that can actually be allocated.

[codebox]

heinemey@gpu-1:~/tmp/nvidia> a.out

FREEMEM,TOTALMEM= 1032392448 1073020928

FREEMEM,TOTALMEM= 4096 1073020928

allocation failed 0

FREEMEM,TOTALMEM= 4096 1073020928

[/codebox]

Cheers,

Eric

avidday · October 1, 2010, 12:34pm

Well, the GTX480 is 64 bit internally and supports stuff like multiple kernel execution, so it might be expected that the state required to make CUDA work is larger. Also there are per-thread resources like local memory which are reserved from sdram, and Fermi is designed to support a lot more threads in flight than the GT200 was. I would imagine that is where a lot of the extra memory goes.

From what I can tell, there are page size limits which come into play, such that there can be less free memory than the page size your memory allocation uses (CUDA isn’t unique in this respect). If you allocate one huge chunk, the driver uses a large page size which isn’t an even multiple of the available memory. If you allocate many smaller allocations, the driver chooses a smaller page size for each and you can allocate all the memory. In my linear algebra codes I usually do this when I want to allocate every last drop of available memory in a single allocation to manage myself (pilfered from V.Volkov):

const size_t MB = 1<<20;

	while( cudaMalloc( (void**)&pool, reserved ) != cudaSuccess )

	{

		reserved -= MB;

		if( reserved < MB )

		{

			free( cpu_buffer );

			cublasAssert( cublasShutdown( ) );

			return false;

		}

	}

avidday · October 1, 2010, 12:34pm

Well, the GTX480 is 64 bit internally and supports stuff like multiple kernel execution, so it might be expected that the state required to make CUDA work is larger. Also there are per-thread resources like local memory which are reserved from sdram, and Fermi is designed to support a lot more threads in flight than the GT200 was. I would imagine that is where a lot of the extra memory goes.

From what I can tell, there are page size limits which come into play, such that there can be less free memory than the page size your memory allocation uses (CUDA isn’t unique in this respect). If you allocate one huge chunk, the driver uses a large page size which isn’t an even multiple of the available memory. If you allocate many smaller allocations, the driver chooses a smaller page size for each and you can allocate all the memory. In my linear algebra codes I usually do this when I want to allocate every last drop of available memory in a single allocation to manage myself (pilfered from V.Volkov):

const size_t MB = 1<<20;

	while( cudaMalloc( (void**)&pool, reserved ) != cudaSuccess )

	{

		reserved -= MB;

		if( reserved < MB )

		{

			free( cpu_buffer );

			cublasAssert( cublasShutdown( ) );

			return false;

		}

	}

tera · October 1, 2010, 12:40pm

CUDA offers some additional functionality on the GTX480 (e.g., printf), so it seems reasonable that more memory is needed for that.

The memory may be fragmented, so that it cannot be allocated in one piece.

tera · October 1, 2010, 12:40pm

CUDA offers some additional functionality on the GTX480 (e.g., printf), so it seems reasonable that more memory is needed for that.

The memory may be fragmented, so that it cannot be allocated in one piece.

Eric_Heinemeyer · October 1, 2010, 2:02pm

The device was initalized with cublasInit(). The next thing I did was to query the amount of available memory, and total memory

through a call to cudaMemGetInfo. Now I tried to allocate the amount of memory as claimed by cudaMemGetInfo must be safe to alloc.

If this fails it simply is a bug that must be fixed. If there is a fragmentation problem, the memory controller MUST know of it

and MUST give me a save limit for my request. Besides, the fragmentation of the device would thus be a consequence of the cublasInit call.

Another thing that occured to me. After initializing the device I would like to grep all memory there is and manage it myself.

Do I need to reserve some of the memory for the binary of the kernel calls? Or do they reside in a different memory

that is safely hidden from the user?

Allocating the memory in smaller chunks, does not solve the problem. The sum of all chunks that can be allocated is still smaller

then the free memory limit that was signaled by cudaGetMemInfo.

Cheers,

Eric

Eric_Heinemeyer · October 1, 2010, 2:02pm

The device was initalized with cublasInit(). The next thing I did was to query the amount of available memory, and total memory

through a call to cudaMemGetInfo. Now I tried to allocate the amount of memory as claimed by cudaMemGetInfo must be safe to alloc.

If this fails it simply is a bug that must be fixed. If there is a fragmentation problem, the memory controller MUST know of it

and MUST give me a save limit for my request. Besides, the fragmentation of the device would thus be a consequence of the cublasInit call.

Another thing that occured to me. After initializing the device I would like to grep all memory there is and manage it myself.

Do I need to reserve some of the memory for the binary of the kernel calls? Or do they reside in a different memory

that is safely hidden from the user?

Allocating the memory in smaller chunks, does not solve the problem. The sum of all chunks that can be allocated is still smaller

then the free memory limit that was signaled by cudaGetMemInfo.

Cheers,

Eric

philippev · October 1, 2010, 10:21pm

The device was initalized with cublasInit(). The next thing I did was to query the amount of available memory, and total memory

through a call to cudaMemGetInfo. Now I tried to allocate the amount of memory as claimed by cudaMemGetInfo must be safe to alloc.

If this fails it simply is a bug that must be fixed. If there is a fragmentation problem, the memory controller MUST know of it

and MUST give me a save limit for my request. Besides, the fragmentation of the device would thus be a consequence of the cublasInit call.

Another thing that occured to me. After initializing the device I would like to grep all memory there is and manage it myself.

Do I need to reserve some of the memory for the binary of the kernel calls? Or do they reside in a different memory

that is safely hidden from the user?

Allocating the memory in smaller chunks, does not solve the problem. The sum of all chunks that can be allocated is still smaller

then the free memory limit that was signaled by cudaGetMemInfo.

Cheers,

Eric

philippev · October 1, 2010, 10:21pm

The device was initalized with cublasInit(). The next thing I did was to query the amount of available memory, and total memory

through a call to cudaMemGetInfo. Now I tried to allocate the amount of memory as claimed by cudaMemGetInfo must be safe to alloc.

If this fails it simply is a bug that must be fixed. If there is a fragmentation problem, the memory controller MUST know of it

and MUST give me a save limit for my request. Besides, the fragmentation of the device would thus be a consequence of the cublasInit call.

Another thing that occured to me. After initializing the device I would like to grep all memory there is and manage it myself.

Do I need to reserve some of the memory for the binary of the kernel calls? Or do they reside in a different memory

that is safely hidden from the user?

Allocating the memory in smaller chunks, does not solve the problem. The sum of all chunks that can be allocated is still smaller

then the free memory limit that was signaled by cudaGetMemInfo.

Cheers,

Eric

philippev · October 1, 2010, 10:35pm

When you call cublasInit, all the kernels of the Cublas library corresponding to your architecture are loaded in the GPU Memory.

Unless you use other LIbraries or your own kernels or also use the GPU for your display, you can use the rest of the GPU memory available for your data.

philippev · October 1, 2010, 10:35pm

When you call cublasInit, all the kernels of the Cublas library corresponding to your architecture are loaded in the GPU Memory.

Unless you use other LIbraries or your own kernels or also use the GPU for your display, you can use the rest of the GPU memory available for your data.

Eric_Heinemeyer · October 2, 2010, 7:54am

This makes perfectly sense. Also it would explain to me the almost doubled amount of “missing memory” in case of the GTX480, as I expect the CuBLAS routines to be quit a bit more complex for the Fermi architecture. :-)

So what happens if at some later point in time durinng my application I have a call to a kernel function that has so fare never been used.

Let us assume that I used all of the available memory that, after the CuBLAS lib was intialized, I could get hold of. Is there a swapping mechanism inside the CUDA drivers or do I get some kernelLaunchError? Is there a way I can explicitly handle the loading and unloading of the kernel binary (cubin-code)?

Eric_Heinemeyer · October 2, 2010, 7:54am

This makes perfectly sense. Also it would explain to me the almost doubled amount of “missing memory” in case of the GTX480, as I expect the CuBLAS routines to be quit a bit more complex for the Fermi architecture. :-)

So what happens if at some later point in time durinng my application I have a call to a kernel function that has so fare never been used.

Let us assume that I used all of the available memory that, after the CuBLAS lib was intialized, I could get hold of. Is there a swapping mechanism inside the CUDA drivers or do I get some kernelLaunchError? Is there a way I can explicitly handle the loading and unloading of the kernel binary (cubin-code)?

Topic		Replies	Views
bug in memory allocation? CUDA Programming and Performance	6	4176	May 24, 2012
cudaHostAlloc can only allocate about 3.5GB of memory out of 128GB CUDA Programming and Performance	7	498	June 2, 2023
Cannot allocate the total amount of global memory benchmark CUDA Programming and Performance	14	10456	November 22, 2010
CUBLAS alloc limit CUDA Programming and Performance	9	10837	June 1, 2009
cublasAlloc fails even though there is enough memory CUDA Programming and Performance	4	10915	December 15, 2009
cuMemAlloc limited to 1/4 total GPU memory? CUDA Programming and Performance	10	12832	April 1, 2010
GPU Allocating memory Memory allocation on GPU CUDA Programming and Performance	2	4687	April 23, 2009
Memory for GPU so small? Jetson TX2	14	4632	October 18, 2021
cudaMalloc fails on huge allocation CUDA Programming and Performance	4	817	March 28, 2011
Cannot allocate "all" memory? cudaMalloc fails with 50MB memory left.. CUDA Programming and Performance	9	9640	July 15, 2008

Limit on cublasAlloc?

Related topics