cuMemAlloc limited to 1/4 total GPU memory?

Mr_Nuke · January 12, 2010, 3:56pm

I’ve noticed that cuMemAlloc sometimes fail, even though I’m requesting less memory than declared available by cuMemGetInfo. When playing around with OpenCL, I found a property that gives the maximum amount of device memory that can be allocated with one Malloc call, and on my 9800GT, it just happens to be 128MB, or 1/4 of the total global memory. After a bit of mangling, I found that the limit in CUDA is also 128MB per cuMemAlloc call.

I found no mention of this in the CUDA documentation, and no way to get information about this limit with CUDA.

Using only 128MB when the driver reports 450+MB available is at the very least frustrating/annoying, and using two or three allocations for the same buffer adds complexity to the kernel I’m unwilling to introduce.

Is there a way to bypass this limitation?

eyalhir74 · January 12, 2010, 7:26pm

I’ve noticed that cuMemAlloc sometimes fail, even though I’m requesting less memory than declared available by cuMemGetInfo. When playing around with OpenCL, I found a property that gives the maximum amount of device memory that can be allocated with one Malloc call, and on my 9800GT, it just happens to be 128MB, or 1/4 of the total global memory. After a bit of mangling, I found that the limit in CUDA is also 128MB per cuMemAlloc call.

I found no mention of this in the CUDA documentation, and no way to get information about this limit with CUDA.

Using only 128MB when the driver reports 450+MB available is at the very least frustrating/annoying, and using two or three allocations for the same buffer adds complexity to the kernel I’m unwilling to introduce.

Is there a way to bypass this limitation?

Hi,

I havent used cuMemAlloc, usually i use the cudaMalloc and never had any problem allocating such big arrays. I get to allocate 1-2GB of array on a

C1060 with no problem. One other issue is memory partitioning.

Can you use the cudaMalloc method?

eyal

tmurray · January 12, 2010, 7:27pm

What OS?

avidday · January 12, 2010, 8:25pm

That doesn’t gel with my experience at all. Most of my linear algebra codes use my own memory manager, and the first thing it does is make a single allocation call to reserved every last free byte from value returned by cuMemGetInfo(). On compute dedicated cards, that means 896Mb, 1Gb, or 1.8Gb in a single call. Never seen anything like that on any CUDA 2.x version on Linux.

Mr_Nuke · January 12, 2010, 8:33pm

Server 2008 R2 Enterprise x64. CUDA 2.3, 195.62 drivers.

@eyalhir74

I can’t use runtime API functions in my such as cudaMalloc in my program, as I’m using the driver API.

tmurray · January 12, 2010, 9:07pm

Welcome to WDDM. There is a limit on the maximum size of a single allocation due to Vista requirements. This is fixed in the upcoming Tesla compute-only drivers for WDDM OSes.

Mr_Nuke · January 12, 2010, 10:28pm

Touche! Never imagined it would be a Windows issue, but doesn’t surprise me at all. This reminds me of the error message in my signature.

But the Tesla drivers would only work for Tesla cards, so basically, there is absolutely no way to overcome this limitation with GeForces under Windows, right?

_Big_Mac · January 12, 2010, 10:32pm

It would be very useful for initial development to have GeForce non WDDM drivers even if that meant that the card couldn’t act as a video adapter.

tmurray · January 12, 2010, 10:32pm

Correct.

Mr_Nuke · January 12, 2010, 10:56pm

I second that, though I believe NVIDIA would rather have us use Tesla cards. Too bad I’m just a student with no life and no money; I’d definitely get some of the Fermi-based Teslas if I could afford them.

_PM · April 1, 2010, 10:21am

Which version of the Tesla drivers was the first one to include this fix?