cudaMalloc3DArray out of memory can not allocate the available amount of memory

Tobi84 · January 3, 2011, 8:47am

Hi,

at the moment i have some trouble regarding allocating memory for a 3d texture.

System: Win 7 x64

System Memory: 12 GB

CPU: Core i7 920 @ 2.67 GHz

Chipset: x58

Cuda Device: Quadro FX 5800 and Tesla C1060 each with 4 GB of Memory

Toolkit Version: 3.2

SDK Version: 3.2

C++: MS Visual Studio 2008 SP1

Both of the cards i have tested have got a compute capability of 1.3. I also made sure,

that the nvcc flag is set accordingly to compile for 1.3

The programming guide states, that the maximum allowed size of a texture

bound to a 3D Array is 2048 x 2048 x 2048.

However, when trying to allocate more than 1216 x 1216 x 1216 my program terminates with

an out of memory exception. Allthough i have 4 GB of Video Memory on each of the devices.

This is the memory information from within the program:

Total mem: 4261085184 free mem: 4122873856

What i basically did to test the maximum size i can allocate is:

typedef unsigned char uchar;

texture<uchar, 3, cudaReadModeNormalizedFloat> tex;         // 3D texture for volume

cudaArray *d_volumeArray = 0;								// Memory for the volume data

cudaChannelFormatDesc channelDesc = cudaCreateChannelDesc<uchar>();

size_t total, free, temp, used;

cutilSafeCall( cudaMemGetInfo(&free, &total));

printf("Total mem: %lu \t free mem: %lu\n", total, free);

for(int i = 1; i < 1024; ++i)

{

	cudaExtent volSize = make_cudaExtent(i * 2, i * 2, i * 2);		// 2, 4, 6, 8, ....

	cutilSafeCall( cudaMemGetInfo(&temp, &total));					// get memory information

	printf("Total mem: %lu \t free mem before malloc: %lu\n", total, temp);	// output

	

	cutilSafeCall( cudaMalloc3DArray(&d_volumeArray, &channelDesc, volSize) );	// allocate the memory on the device

	cutilSafeCall( cudaMemGetInfo(&free, &total) );					// update memory information

	

	used = temp - free;												// calculate the required amount of memory in bytes

	

	printf("used %lu amount of mem\n", used);		

	cutilSafeCall( cudaFreeArray(d_volumeArray) );					// free the array

	

	cutilSafeCall( cudaMemGetInfo(&free, &total));					// update memory info

	printf("Total mem: %lu \t free mem after free: %lu\n", total, free); // output

}

using this little test i got the following values for the memory allocated:

from (994, 994, 994)    to (1024, 1024, 1024) cudaMalloc3DArray uses 1073741824 bytes which is 1024^3

from (1026, 1026, 1026) to (1056, 1056, 1056) cudaMalloc3DArray uses 1213267968 bytes which is some strange value of 1066,560700... ^3

from (1058, 1058, 1058) to (1088, 1088, 1088) cudaMalloc3DArray uses 1287913472 bytes which is 1088^3

from (1090, 1090, 1090) to (1120, 1120, 1120) cudaMalloc3DArray uses 1445068800 bytes which is some strange value of 1130,566661... ^3

from (1122, 1122, 1122) to (1152, 1152, 1152) cudaMalloc3DArray uses 1528823808 bytes which is 1152^3

from (1154, 1154, 1154) to (1184, 1184, 1184) cudaMalloc3DArray uses 1704656896 bytes which is some strange value of 1194,571987... ^3

from (1186, 1186, 1186) to (1216, 1216, 1216) cudaMalloc3DArray uses 1798045696 bytes which is 1216^3

When trying to allocate memory with an extent of (1218, 1218, 1218) the out of memory exception is returned by cudaMalloc3DArray()

but 1218^3 is just 1806932232 bytes. Even if i assume, that the allocation of more memory than required as listed above is continuing, then

one would assume, that the next strange number is somewhat of 1258,577… and rounding it up to 1260, this would only require 2000376000 bytes of memory.

Remember that the program tells me that i have free mem: 4122873856 bytes…

Any hints if this is something i just missunderstood or if it’s a bug that will be fixed in future releases of the toolkit or

the driver itself would be very helpful.

Thanks

Tobi

avidday · January 3, 2011, 9:17am

This is a known limitation of the Windows Vista and 7 WDDM subsystem. Windows runs its own GPU memory manager, and that manager limits the maximum memory allocation that can be made in a single call to much less than the capacity of the card. NVIDIA offer a compute only driver for your C1060 that bypasses all the WDDM limits. I don’t use Windows, so I can’t offer more details, but the TCC driver is probably what you want to fix this.

Tobi84 · January 3, 2011, 9:55am

Hi, and thanks for this fast reply!

Well thats kind of a situation i’m facing here if this is the case.

I allready heard of some limitations due to the WDDM of win 7 but i thought more about things like windows services

that do not have access to the gpu etc. Wouldn’t have thought that the limitations are that big.

I just dipped into the release nodes of the TCC Driver you mentioned and am somewhat frustrated now. It says that:

Well the problem is, i have a 3 GPU System here with one Quadro FX 5800 and two Tesla C1060.

So i have a system with at least one GPU that requires hardware acceleration for displays which colides with point 3 of

the quoted list. Furthermore i need OpenGL interoperability in my application which colides whith point 2…

Is there any other way to bypass the limitations of the WDDM Subsystem ( maybe i should ask this in a windows forum i guess ) or

do i have to switch to linux in order to get access to all GPUs without limitations?

Thanks again

Tobi

[EDIT]

I googled around a bit for limitations of wddm and cuda and found that one can change the timeout value for the

TIMEOUT DETECTION and RECOVERY mode of wddm in the windows registry. I wonder if there’s a similar trick for the amount

of memory that can be allocated at once ( i highly doubt it but, well… )
Secondly i now know that it works to allocate more memory in seperate malloc calls. Is there any performant way to

allocate memory in smaller chunks, tie it all together to one big memory block, so that it would then perform

as if it would have been allocated in one piece in the first place (e.g bind one texture reference to one memory handle)

and then finaly free the hole memory again?

Thanks

Tobi

ParallelPoint · January 31, 2011, 10:59pm

I read through the Driver PDF and it sounds to me like the following.

-Installing the TCC driver will override any other Nvidia driver for every Nvidia based GPU.

-For the devices that do not support TCC they will show up as generic VGA devices that you can use for your displays. (Though 3D graphic support is disabled).

-Those same non supported TCC devices can still support CUDA. (Assuming it is a CUDA enabled GPU).

Topic		Replies	Views
How much GPU memory can cudaMalloc get? CUDA Programming and Performance	17	15141	April 2, 2022
bypass wddm restrictions how to allocate memory for a huge 3d texture CUDA Programming and Performance	5	14329	February 1, 2011
Device memory size CUDA Programming and Performance	11	46832	June 6, 2008
Cudamalloc attempting to allocate more memory than it is supposed to CUDA Programming and Performance cuda	15	48	January 13, 2025
Cannot allocate "all" memory? cudaMalloc fails with 50MB memory left.. CUDA Programming and Performance	9	9590	July 15, 2008
bug in memory allocation? CUDA Programming and Performance	6	4155	May 24, 2012
cudaMalloc Incorrectly Reporting out of Memory Can't allocate more than 1151 MB at a time. CUDA Programming and Performance	6	1733	February 1, 2011
Question about cudaMalloc Behavior When Exceeding Physical VRAM on GTX 1070 CUDA Programming and Performance	1	31	December 26, 2024
cudaMalloc error in big loop CUDA Programming and Performance	12	15600	May 21, 2008
Crashes leads to 'out of memory' CUDA Programming and Performance	8	1782	October 20, 2009

cudaMalloc3DArray out of memory can not allocate the available amount of memory

Related topics