How much GPU memory can cudaMalloc get?

Hello everyone,

I tried to get 3GB memory by cudaMalloc.

OS : Windows7 64bit

GPU(display) : QuadroFX3800

GPU(GPGPU) : TESLA C2070

CUDA : 3.2 RC

Driver : 261.00

Compiler : Visual Studio 2008

cudaSetDevice( 0 ); // C2070

cudaError_t cu_err;

void* ptr = NULL;

size_t sz = (size_t)3 * 1024 * 1024 * 1024;

cu_err = cudaMalloc( (void**)&ptr, sz );

if( cu_err != cudaSuccess ){

	printf("%s", cudaGetErrorString( cu_err ));

	return;

}

But this code couldn’t get 3GB memory, though C2070 has 6GB memory. So I’d like to know what is the problem and if there is a maximum allocation size.

By the way, I could allocate 2GB memory by the code.

Best regards,

Hello everyone,

I tried to get 3GB memory by cudaMalloc.

OS : Windows7 64bit

GPU(display) : QuadroFX3800

GPU(GPGPU) : TESLA C2070

CUDA : 3.2 RC

Driver : 261.00

Compiler : Visual Studio 2008

cudaSetDevice( 0 ); // C2070

cudaError_t cu_err;

void* ptr = NULL;

size_t sz = (size_t)3 * 1024 * 1024 * 1024;

cu_err = cudaMalloc( (void**)&ptr, sz );

if( cu_err != cudaSuccess ){

	printf("%s", cudaGetErrorString( cu_err ));

	return;

}

But this code couldn’t get 3GB memory, though C2070 has 6GB memory. So I’d like to know what is the problem and if there is a maximum allocation size.

By the way, I could allocate 2GB memory by the code.

Best regards,

While I am no expert in working with CUDA on Windows, that sounds like it might well be a WDM limitation. Windows Vista and later have their own GPU memory manager which imposes some additional limitations on how much memory a process can grab in single memory allocation call. There is a dedicated compute driver for Tesla on WDM versions of Windows which might let you bypass these limits, but I stress that is just a guess as to what might be going on.

While I am no expert in working with CUDA on Windows, that sounds like it might well be a WDM limitation. Windows Vista and later have their own GPU memory manager which imposes some additional limitations on how much memory a process can grab in single memory allocation call. There is a dedicated compute driver for Tesla on WDM versions of Windows which might let you bypass these limits, but I stress that is just a guess as to what might be going on.

I remember there was a limitation… One of the releases, I think they fixed (or) limited the limitation. You may want to check the the latest CUDA 3.2 release notes and compare with 3.1 release notes.

I remember there was a limitation… One of the releases, I think they fixed (or) limited the limitation. You may want to check the the latest CUDA 3.2 release notes and compare with 3.1 release notes.

I saw this sentence in the release note.

I guess this is the answer for my question.

Thank you,

I saw this sentence in the release note.

I guess this is the answer for my question.

Thank you,

Switch to TCC on the C2070 and you’ll be able to allocate quite a bit of memory.

Switch to TCC on the C2070 and you’ll be able to allocate quite a bit of memory.

How much GPU memory can cudaMalloc get at the maximum ? if GPU with 8GB GDDR6 memory on the hardware card is running under TCC mode on Linux? thanks a lot.

You’ll have to discover this experimentally. There are no published data and no formulas that can be used.

Note that TCC applies to Windows systems only, it does not apply to Linux systems. As @Robert_Crovella says, the maximum size of a single allocation for a particular system configuration cannot be established a priori. But here are some experimentally determined numbers to give you a rough idea. This is from a system running Windows 10 Professional with 32 GB of system memory, CUDA 11.x, idling with only the desktop running.

Quadro RTX 4000, WDDM driver: 7.25 GB (7.78e9 bytes) out of 8 GB provided by the hardware. About 90%.
Quadro P2000, TCC driver: 4.85 GB (5.20e9 bytes) out of 5 GB provided by the hardware. About 97%.

When using the WDDM driver, GPU memory allocations are serviced by the Windows operating system’s memory allocator. When using the TCC driver (not possible with all GPUs!), the driver provides its own allocation mechanism, i.e. the operating system’s mechanism is bypassed. For reasons unknown to me, the maximum size of GPU memory allocations when using the WDDM driver always seems to be significantly smaller than when using the TCC driver.

For Linux, I would expect the maximum size of a GPU memory allocation to be more in line with the TCC driver scenario, so maybe 95% of the memory provided by the hardware. This is a guesstimate, not a guarantee. Factors such as GPU memory usage by other tasks (including a GUI), total amount of system memory, and internal fragmentation in the allocator, could all play a role in what is available to a CUDA application. You would want to write your software such that it functions with any amount of memory and exits cleanly in the worst case (unable to run with the amount of memory found).

njiuffa , thanks.
in fact , before our main GPU app run, we need to run(on SMs) a short video-memory-test application(utilizing the global memory of high memory bandwidth feature) to guarantee all memory cells on-board GDDR chips are good. in this short stage , we can use a simple and clean environment , such as , no usage of other task (including a GUI), no usage of GPU local memory, and no internal fragmentation in the allocator. and the goal what we want to achieve is that our this small video-memory-test app can test entire memory space which provided by all on-board GDDR chips. so we could try several times to ‘cudaMalloc’ to get several segments , but only one goal : all the segments we get by ‘cudeMalloc’ can cover entire GDDR chips’ memory space.
can our video-memory-test achieve this goal ?

No this is not possible. CUDA reserves some device memory for its own use.

I don’t have any ideas of how to do that through CUDA. Just like I don’t now how to do this for the system memory of my computer through C++'s malloc. Once an OS takes control of memory, there is usually no chance for a user application to get access to the entire physical memory. This certainly applies to GPUs using WDDM driver, where the Windows operating system (not the CUDA runtime) has complete control over memory allocations.

The typical way to side-step this limitation is to perform hardware tests prior to OS boot, and for that purpose all my systems have hardware tests accessible via the BIOS setup after a cold start. There are multiple existing GPU memory test apps out there (and have been for many years), but to my knowledge none of them can test the full physical memory.

If your use case relies on GPU memory working absolutely flawlessly, consider deploying GPUs with ECC support. With ECC single-bit errors can be fixed on the fly, while double-bit errors are detected which can be used to halt operations.

got it!
ECC also is a suitable option for our app.
thanks for this quick reply!

got it! your info also is a key info for us.
thanks for this quick reply!