Device memory size

Dear All

How can we guess how large we can set up the global memory data block which can be accessed from any block and thread?

In previous FBO-based implementation, we have been suffering from the limit in the texture size.


You are basically limited only by the video memory on the graphics card. There is no limit like “texture size” in CUDA, because the memory is linearly allocated.

As on the CPU, you will have to allocate the size you need, and check the error value returned by cudaMalloc to make sure it succeeds.

There is overhead due to the CUDA runtime and display driver, and if the desktop is extended onto the GPU, due to the desktop / window manager, as well as any graphics apps that are currently running.


This explains why on my GTS I can “only” allocate 582 MB. But I was

surprised that an 1600x1200x32bit 2D Desktop and GDI/OGL/D3D drivers

eat up 58 MB (with no other graphical apps running).

When I switched desktop resolution to 320x200x8bit (yes this works with WinXP,

but it’s even hard to use the command line with this small screen) I was able

to alloc 589 MB linear space for CUDA.

Looking forwards to March 05. I hope there will be an QuadroFX with 1GB.



A related question:

I understand that CUDA buffers are static alloc’ed and I have to implement memory management myself. OpenGL textures however can be swapped in/out by the OGL driver. So when I start to allocate CUDA buffers when GPU memory is already nearly exhausted, will it swap out OGL textures (assuming the desktop is on the same card), CUDA textures borrowed from OGL (giving dangling pointers :w00t: ), OGL FBOs ?
Or will the CUDA alloc just fail?


Under low memory conditions, the CUDA malloc should start failing.

Note that CUDA cannot currently directly access OpenGL textures or FBOs, only buffer objects.

Is there a recommended way to determine the global memory that is available for use (i.e. total memory minus overhead for memory used for drivers, cuda, etc)? I haven’t run into any such API so my best guess right now is to just try allocating larger blocks until allocation fails.

Also, I need to allocate several large buffers that have the same lifetime. Is it better to coalesce these allocations into a single big allocation, or to make them separately.


P97 of the Programming guide

Another related question:

sometimes while debugging a kernel something goes wrong (due to programming errors) and the program exits with an exception. And sometimes due to the exception cudaFree calls are not executed. Now it seems to me that not freeing up used memory results in reduction of overall available free memory although the thread allocating the memory is not running anymore. Is this a correct observation?
And if yes, is there a way to let’s say free the memory used by CUDA? My workaround so far is to reboot the machine when this happened (infrequent but rather annoying when it happens).

I have never observed a loss of available GPU memory after repeatedly killing CUDA processes mid computation. The driver frees used GPU memory after your process terminates. Are you seeing something different? (Note: I’m using CUDA on Linux 64 bit.)


I used to run into this problem with CUDA 0.8, but I reported the problem and it seems to have been

fixed in CUDA 1.0, at least for the case that I had been experiencing. I haven’t had this occur in

several months now since CUDA 1.0 came out. One big change in our installation here since 0.8

is that we’re using 64-bit kernels on our systems now. When I had been experiencing the problem

in the early versions of CUDA, we were using 32-bit kernels/drivers, etc. I don’t know if this has

any bearing on the problem, but thought I’d mention it.

John Stone

I can now confirm the described behavior. However it’s my fault. I just realized that I’m doing this on a WXP x64 machine which is currently not supported. So the driver I’m using seems to be unable to free up the memory after the thread terminated. Should work fine (as described here) when using the correct OS with the correct driver.

I am new to CUDA. Please help me to clear my following points:

  1. What is the relationship between the device memory and the global memory, constant memory and the texture memory? following formula is correct?

Device memory (bytes) = global memory (bytes) + constant memory (bytes) + texture memory(bytes)?

  1. GeForce 8800 GTX has a constant memory 64KB. The CUDA application program will share the 64KB constant memory with other processes such as display driver or not?

  2. What is the texture memory size? Is it the part of the global memory bound to texture? In other words, the texture memory size varies with the texture bound to the device memory space allocated by cudaMalloc() or cudaMallocArray(), is it?

  3. After a cudaArray bound to a texture, I can read the cudaArray data through tex2D() fetching the data from the texture memory. If the cudaArray did not bind to the texture, can I access the cudaArray data (read/write)? If can, please let me know how. If can not, please let me know why?