Limit on cublasAlloc?

Hi,

until recently I was under the impression that I should be able to allocate all of the GPU memory.

The GTX 285 has 16373 x 64 kiB total global memory as shown by deviceQuery.

That’s a little over 1023 MiB. When allocating memory in a simple loop (cf. code below)

I can allocate at most 984 MiB. Thus I’m missing 629 x 64KiB (about 39 MiB).

On a GTX 480, which has 24569 x 64 KiB total global memory (about 1535 MiB), I can

allocate 1429 MiB, thus missing 1705 x 64 KiB (about 107 MiB).

Has anybody else observed this behaviour? Does somebody know what this is like on

the Tesla C2050 or C2070? Can somebody give an explanation for this?

I tested this with CuBLAS 3.2 RC with the latest 64 bit linux driver and the following fortran programm

using the standard fortran wrapper.

Cheers,

Eric

[codebox]

  PROGRAM TEST_MEM

  IMPLICIT NONE

INTEGER GPURAM, MIBYTE

  PARAMETER (GPURAM=1023, MIBYTE=1024*1024)

INTEGER IX,STAT

  EXTERNAL  CUBLAS_INIT, CUBLAS_SHUTDOWN, CUBLAS_ALLOC, CUBLAS_FREE

  INTEGER*4 CUBLAS_INIT, CUBLAS_SHUTDOWN, CUBLAS_ALLOC, CUBLAS_FREE

  INTEGER*8 DEVLOCB

! > Initialize CuBLAS

STAT = CUBLAS_INIT()

  IF (STAT .NE. 0)  WRITE(*,*) 'cublas init failed'

IX = 1

100 STAT = CUBLAS_ALLOC(1,MIBYTE, DEVLOCB)

  IF (STAT .NE. 0)  THEN 

     WRITE(*,*) 'allocation failed for ix =',IX

  ELSE

     IX = IX + 1

     GOTO 100

  END IF

! > Shutdown CuBLAS

STAT = CUBLAS_SHUTDOWN()

  IF (STAT .NE. 0)  WRITE(*,*) 'cublas shutdown failed'

END

[/codebox]

The CUDA context uses something in the order of 40-50Mb of memory. That is where the “missing” memory is going.

The CUDA context uses something in the order of 40-50Mb of memory. That is where the “missing” memory is going.

I can live with that, but I do not understand why it increases in case of the GTX480.

Another thing that I just tried is to query the available memory with cudaMemGetInfo

The problem I ran into was that I could not allocate the total free memory.

I had to downsize the requested memory by 4 MiB to successfully allocate the memory.

A subsequent call to cudaMemGetInfo showed the 4 MiB as free memory, but I could not

allocate another byte from it.???

IMHO the function MUST give me a reliable size that can actually be allocated.

[codebox]

heinemey@gpu-1:~/tmp/nvidia> a.out

FREEMEM,TOTALMEM= 1032392448 1073020928

FREEMEM,TOTALMEM= 4096 1073020928

allocation failed 0

FREEMEM,TOTALMEM= 4096 1073020928

[/codebox]

Cheers,

Eric

I can live with that, but I do not understand why it increases in case of the GTX480.

Another thing that I just tried is to query the available memory with cudaMemGetInfo

The problem I ran into was that I could not allocate the total free memory.

I had to downsize the requested memory by 4 MiB to successfully allocate the memory.

A subsequent call to cudaMemGetInfo showed the 4 MiB as free memory, but I could not

allocate another byte from it.???

IMHO the function MUST give me a reliable size that can actually be allocated.

[codebox]

heinemey@gpu-1:~/tmp/nvidia> a.out

FREEMEM,TOTALMEM= 1032392448 1073020928

FREEMEM,TOTALMEM= 4096 1073020928

allocation failed 0

FREEMEM,TOTALMEM= 4096 1073020928

[/codebox]

Cheers,

Eric

Well, the GTX480 is 64 bit internally and supports stuff like multiple kernel execution, so it might be expected that the state required to make CUDA work is larger. Also there are per-thread resources like local memory which are reserved from sdram, and Fermi is designed to support a lot more threads in flight than the GT200 was. I would imagine that is where a lot of the extra memory goes.

From what I can tell, there are page size limits which come into play, such that there can be less free memory than the page size your memory allocation uses (CUDA isn’t unique in this respect). If you allocate one huge chunk, the driver uses a large page size which isn’t an even multiple of the available memory. If you allocate many smaller allocations, the driver chooses a smaller page size for each and you can allocate all the memory. In my linear algebra codes I usually do this when I want to allocate every last drop of available memory in a single allocation to manage myself (pilfered from V.Volkov):

const size_t MB = 1<<20;

	while( cudaMalloc( (void**)&pool, reserved ) != cudaSuccess )

	{

		reserved -= MB;

		if( reserved < MB )

		{

			free( cpu_buffer );

			cublasAssert( cublasShutdown( ) );

			return false;

		}

	}

Well, the GTX480 is 64 bit internally and supports stuff like multiple kernel execution, so it might be expected that the state required to make CUDA work is larger. Also there are per-thread resources like local memory which are reserved from sdram, and Fermi is designed to support a lot more threads in flight than the GT200 was. I would imagine that is where a lot of the extra memory goes.

From what I can tell, there are page size limits which come into play, such that there can be less free memory than the page size your memory allocation uses (CUDA isn’t unique in this respect). If you allocate one huge chunk, the driver uses a large page size which isn’t an even multiple of the available memory. If you allocate many smaller allocations, the driver chooses a smaller page size for each and you can allocate all the memory. In my linear algebra codes I usually do this when I want to allocate every last drop of available memory in a single allocation to manage myself (pilfered from V.Volkov):

const size_t MB = 1<<20;

	while( cudaMalloc( (void**)&pool, reserved ) != cudaSuccess )

	{

		reserved -= MB;

		if( reserved < MB )

		{

			free( cpu_buffer );

			cublasAssert( cublasShutdown( ) );

			return false;

		}

	}

CUDA offers some additional functionality on the GTX480 (e.g., printf), so it seems reasonable that more memory is needed for that.

The memory may be fragmented, so that it cannot be allocated in one piece.

CUDA offers some additional functionality on the GTX480 (e.g., printf), so it seems reasonable that more memory is needed for that.

The memory may be fragmented, so that it cannot be allocated in one piece.

The device was initalized with cublasInit(). The next thing I did was to query the amount of available memory, and total memory

through a call to cudaMemGetInfo. Now I tried to allocate the amount of memory as claimed by cudaMemGetInfo must be safe to alloc.

If this fails it simply is a bug that must be fixed. If there is a fragmentation problem, the memory controller MUST know of it

and MUST give me a save limit for my request. Besides, the fragmentation of the device would thus be a consequence of the cublasInit call.

Another thing that occured to me. After initializing the device I would like to grep all memory there is and manage it myself.

Do I need to reserve some of the memory for the binary of the kernel calls? Or do they reside in a different memory

that is safely hidden from the user?

Allocating the memory in smaller chunks, does not solve the problem. The sum of all chunks that can be allocated is still smaller

then the free memory limit that was signaled by cudaGetMemInfo.

Cheers,

Eric

The device was initalized with cublasInit(). The next thing I did was to query the amount of available memory, and total memory

through a call to cudaMemGetInfo. Now I tried to allocate the amount of memory as claimed by cudaMemGetInfo must be safe to alloc.

If this fails it simply is a bug that must be fixed. If there is a fragmentation problem, the memory controller MUST know of it

and MUST give me a save limit for my request. Besides, the fragmentation of the device would thus be a consequence of the cublasInit call.

Another thing that occured to me. After initializing the device I would like to grep all memory there is and manage it myself.

Do I need to reserve some of the memory for the binary of the kernel calls? Or do they reside in a different memory

that is safely hidden from the user?

Allocating the memory in smaller chunks, does not solve the problem. The sum of all chunks that can be allocated is still smaller

then the free memory limit that was signaled by cudaGetMemInfo.

Cheers,

Eric

When you call cublasInit, all the kernels of the Cublas library corresponding to your architecture are loaded in the GPU Memory.

Unless you use other LIbraries or your own kernels or also use the GPU for your display, you can use the rest of the GPU memory available for your data.

When you call cublasInit, all the kernels of the Cublas library corresponding to your architecture are loaded in the GPU Memory.

Unless you use other LIbraries or your own kernels or also use the GPU for your display, you can use the rest of the GPU memory available for your data.

This makes perfectly sense. Also it would explain to me the almost doubled amount of “missing memory” in case of the GTX480, as I expect the CuBLAS routines to be quit a bit more complex for the Fermi architecture. :-)

So what happens if at some later point in time durinng my application I have a call to a kernel function that has so fare never been used.

Let us assume that I used all of the available memory that, after the CuBLAS lib was intialized, I could get hold of. Is there a swapping mechanism inside the CUDA drivers or do I get some kernelLaunchError? Is there a way I can explicitly handle the loading and unloading of the kernel binary (cubin-code)?

This makes perfectly sense. Also it would explain to me the almost doubled amount of “missing memory” in case of the GTX480, as I expect the CuBLAS routines to be quit a bit more complex for the Fermi architecture. :-)

So what happens if at some later point in time durinng my application I have a call to a kernel function that has so fare never been used.

Let us assume that I used all of the available memory that, after the CuBLAS lib was intialized, I could get hold of. Is there a swapping mechanism inside the CUDA drivers or do I get some kernelLaunchError? Is there a way I can explicitly handle the loading and unloading of the kernel binary (cubin-code)?