Cannot allocate "all" memory? cudaMalloc fails with 50MB memory left..

I have a 8600GT card with 256MB memory and for some reason cannot allocate all its device memory for CUDA purposes. I understand about surface memory, etc. but even in (Linux) level3 runtime things don’t work.

The maximum amount I can allocate is about 200MB out of the 256MB, everything above this returns “out of memory”. Surface memory would be, for me, 1280x1024*32 = 5120KB not 60.000KB as the test below shows…

A very simple example:

#include <cuda_runtime.h>

#include <stdio.h>

#include "cutil.h"

__device__ float *d_dout;

int main() {

   cudaDeviceProp prop;

   CUDA_SAFE_CALL(cudaGetDeviceProperties(&prop, 0));

   size_t size = 197*(1024*1024);

   printf("Total: %dB Allocated: %dB (difference: %dKB)\n", prop.totalGlobalMem, size, (prop.totalGlobalMem - size)/1024);

   CUDA_SAFE_CALL(cudaMalloc((void**)&d_dout, size));

  return 0;


Running this produces:

Now I do NOT believe that my X-server is using 60MB for surface memory on my card, or something else is even when I am not using any X-server. Does anyone have an idea what can cause this, and more importantly how I can solve it?

Edit: I just ran the same test on windows, and there I could allocate 212MB. So 15MB more than on Linux x86_64. Still far from about 240MB which I would deem reasonable :(

I would imagine a large portion of the space that you can’t allocate is taken by the framebuffer. The bigger the monitor, the bigger the framebuffer. It might be possible to tell the driver not to allocate a framebuffer, thus turning your computer into a headless server. Hopefully that would free up even more space.

In my experience, X takes up a lot of space (pixmaps and backing store for all windows for example). But seeing as though running CUDA applications with an X server running is not a supported configuration, I was forced to set up another computer that runs without X and so I don’t have X-taking-up-space problems anymore.

I understand that. The programming guide says:

Now I read that as the GPU dedicates 7.68MB for the primary surface (this is my resolution as well), leaving ~245MB free for other uses. This is fine. However I cannot tell the difference between this theoretical maximum and what I experience; about 200MB. Turning off the X-server on a linux machine, increases this limit about about 4-5MB, but not more.

Ah. I’m sorry I didn’t do the math there. I too have tried the exact same test you’ve just done and discovered a large amount of ‘unallocatable’ space on the card that isn’t accounted for by framebuffer or display services. I wasn’t able to get a full answer from nvidia about the cause either. My suspicion is the driver allocates it for something but I’m just guessing.

Well, I’ve also decided to contribute and it seems that my NVS 140M (128MB) can allocate about 78MB on the GPU. 79MB+ fails.

There seems to be about 50MB of not allocatable space. I’m running XP in 1440x900x32bit so there’s some memory used by the buffer but that’s not 50MB ;)

BTW: I was told that this card can allocate up to a total of 512MB from RAM when it requires so. Can I ‘force’ that in CUDA somehow?

Can someone from NVIDIA shine a light on this?

It also seems that after a few failed runs you are able to allocate less and less memory. After starting XP I could allocate about 200MB, but after a few tests the maximum I got was 150MB…

I’m going to try an older driver to see how that behaves.


One possible reason for something like this could be fragmentation of memory, as CUDA itself takes up some memory. Have you tried allocating in small chunks, say 1MB, to see how much of that you can allocate before you’re out of memory?


When I performed this test, I allocated 1MB chunks in linear memory and also texture memory. Not surprisingly I was able to allocate fewer 1MB textures than linear memory allocations (probably texture memory overhead I’m guessing). Anyway, the point is, even with 1MB allocations I got similar results to Ojiisan.

I get same results even on a tesla C870 board. This one is not even used for displaying things (for not having a graphics output) but if I run getMemInfo at the very beginning of my program (before allocating anything) I get the total memory - 50-60MB as free memory.

At the 1.5GB amount of memory on the tesla this is not too bad but at my 8400MGS (256MB) 50MB are a lot.

So any explanation would be welcome.

And like QmQ mentioned I would also be interested in using the “shared” CPU RAM which should be possible on laptops.

Indeed. cuMemGetInfo() returns a total of 255MB of available memory but only 212 of free memory. Where did 40MB of my machine go? :(

Edit: a bit off-topic but after my CUDA runs I started getting less-and-less available free memory. First I was about to blame NVIDIA/CUDA for it, but I figured out the cause. It seems iTunes is leaking D3D memory like hell. When you switch the cover view it starts cannibalizing megabyte after megabyte… Thank god I figured this out, now the program is shut down :)