cudaMalloc3DArray out of memory can not allocate the available amount of memory


at the moment i have some trouble regarding allocating memory for a 3d texture.

System: Win 7 x64

System Memory: 12 GB

CPU: Core i7 920 @ 2.67 GHz

Chipset: x58

Cuda Device: Quadro FX 5800 and Tesla C1060 each with 4 GB of Memory

Toolkit Version: 3.2

SDK Version: 3.2

C++: MS Visual Studio 2008 SP1

Both of the cards i have tested have got a compute capability of 1.3. I also made sure,

that the nvcc flag is set accordingly to compile for 1.3

The programming guide states, that the maximum allowed size of a texture

bound to a 3D Array is 2048 x 2048 x 2048.

However, when trying to allocate more than 1216 x 1216 x 1216 my program terminates with

an out of memory exception. Allthough i have 4 GB of Video Memory on each of the devices.

This is the memory information from within the program:

Total mem: 4261085184 free mem: 4122873856

What i basically did to test the maximum size i can allocate is:

typedef unsigned char uchar;

texture<uchar, 3, cudaReadModeNormalizedFloat> tex;         // 3D texture for volume

cudaArray *d_volumeArray = 0;								// Memory for the volume data

cudaChannelFormatDesc channelDesc = cudaCreateChannelDesc<uchar>();

size_t total, free, temp, used;

cutilSafeCall( cudaMemGetInfo(&free, &total));

printf("Total mem: %lu \t free mem: %lu\n", total, free);

for(int i = 1; i < 1024; ++i)


	cudaExtent volSize = make_cudaExtent(i * 2, i * 2, i * 2);		// 2, 4, 6, 8, ....

	cutilSafeCall( cudaMemGetInfo(&temp, &total));					// get memory information

	printf("Total mem: %lu \t free mem before malloc: %lu\n", total, temp);	// output


	cutilSafeCall( cudaMalloc3DArray(&d_volumeArray, &channelDesc, volSize) );	// allocate the memory on the device

	cutilSafeCall( cudaMemGetInfo(&free, &total) );					// update memory information


	used = temp - free;												// calculate the required amount of memory in bytes


	printf("used %lu amount of mem\n", used);		

	cutilSafeCall( cudaFreeArray(d_volumeArray) );					// free the array


	cutilSafeCall( cudaMemGetInfo(&free, &total));					// update memory info

	printf("Total mem: %lu \t free mem after free: %lu\n", total, free); // output


using this little test i got the following values for the memory allocated:

from (994, 994, 994)    to (1024, 1024, 1024) cudaMalloc3DArray uses 1073741824 bytes which is 1024^3

from (1026, 1026, 1026) to (1056, 1056, 1056) cudaMalloc3DArray uses 1213267968 bytes which is some strange value of 1066,560700... ^3

from (1058, 1058, 1058) to (1088, 1088, 1088) cudaMalloc3DArray uses 1287913472 bytes which is 1088^3

from (1090, 1090, 1090) to (1120, 1120, 1120) cudaMalloc3DArray uses 1445068800 bytes which is some strange value of 1130,566661... ^3

from (1122, 1122, 1122) to (1152, 1152, 1152) cudaMalloc3DArray uses 1528823808 bytes which is 1152^3

from (1154, 1154, 1154) to (1184, 1184, 1184) cudaMalloc3DArray uses 1704656896 bytes which is some strange value of 1194,571987... ^3

from (1186, 1186, 1186) to (1216, 1216, 1216) cudaMalloc3DArray uses 1798045696 bytes which is 1216^3

When trying to allocate memory with an extent of (1218, 1218, 1218) the out of memory exception is returned by cudaMalloc3DArray()

but 1218^3 is just 1806932232 bytes. Even if i assume, that the allocation of more memory than required as listed above is continuing, then

one would assume, that the next strange number is somewhat of 1258,577… and rounding it up to 1260, this would only require 2000376000 bytes of memory.

Remember that the program tells me that i have free mem: 4122873856 bytes…

Any hints if this is something i just missunderstood or if it’s a bug that will be fixed in future releases of the toolkit or

the driver itself would be very helpful.



This is a known limitation of the Windows Vista and 7 WDDM subsystem. Windows runs its own GPU memory manager, and that manager limits the maximum memory allocation that can be made in a single call to much less than the capacity of the card. NVIDIA offer a compute only driver for your C1060 that bypasses all the WDDM limits. I don’t use Windows, so I can’t offer more details, but the TCC driver is probably what you want to fix this.

Hi, and thanks for this fast reply!

Well thats kind of a situation i’m facing here if this is the case.

I allready heard of some limitations due to the WDDM of win 7 but i thought more about things like windows services

that do not have access to the gpu etc. Wouldn’t have thought that the limitations are that big.

I just dipped into the release nodes of the TCC Driver you mentioned and am somewhat frustrated now. It says that:

Well the problem is, i have a 3 GPU System here with one Quadro FX 5800 and two Tesla C1060.

So i have a system with at least one GPU that requires hardware acceleration for displays which colides with point 3 of

the quoted list. Furthermore i need OpenGL interoperability in my application which colides whith point 2…

Is there any other way to bypass the limitations of the WDDM Subsystem ( maybe i should ask this in a windows forum i guess ) or

do i have to switch to linux in order to get access to all GPUs without limitations?

Thanks again



  • I googled around a bit for limitations of wddm and cuda and found that one can change the timeout value for the

    TIMEOUT DETECTION and RECOVERY mode of wddm in the windows registry. I wonder if there’s a similar trick for the amount

    of memory that can be allocated at once ( i highly doubt it but, well… )

  • Secondly i now know that it works to allocate more memory in seperate malloc calls. Is there any performant way to

    allocate memory in smaller chunks, tie it all together to one big memory block, so that it would then perform

    as if it would have been allocated in one piece in the first place (e.g bind one texture reference to one memory handle)

    and then finaly free the hole memory again?



I read through the Driver PDF and it sounds to me like the following.

-Installing the TCC driver will override any other Nvidia driver for every Nvidia based GPU.

-For the devices that do not support TCC they will show up as generic VGA devices that you can use for your displays. (Though 3D graphic support is disabled).

-Those same non supported TCC devices can still support CUDA. (Assuming it is a CUDA enabled GPU).