I’ve noticed that cuMemAlloc sometimes fail, even though I’m requesting less memory than declared available by cuMemGetInfo. When playing around with OpenCL, I found a property that gives the maximum amount of device memory that can be allocated with one Malloc call, and on my 9800GT, it just happens to be 128MB, or 1/4 of the total global memory. After a bit of mangling, I found that the limit in CUDA is also 128MB per cuMemAlloc call.
I found no mention of this in the CUDA documentation, and no way to get information about this limit with CUDA.
Using only 128MB when the driver reports 450+MB available is at the very least frustrating/annoying, and using two or three allocations for the same buffer adds complexity to the kernel I’m unwilling to introduce.
That doesn’t gel with my experience at all. Most of my linear algebra codes use my own memory manager, and the first thing it does is make a single allocation call to reserved every last free byte from value returned by cuMemGetInfo(). On compute dedicated cards, that means 896Mb, 1Gb, or 1.8Gb in a single call. Never seen anything like that on any CUDA 2.x version on Linux.
I second that, though I believe NVIDIA would rather have us use Tesla cards. Too bad I’m just a student with no life and no money; I’d definitely get some of the Fermi-based Teslas if I could afford them.