What happen if memory is full?


All in the title. I mean if the shared memory of the block is full what happen? does the program stop or the global memory is use?

And I guess that when global memory is full we have an memory error, such as ‘unspecified launch failure’

How do you expect shared memory to become “full”? There are no dynamic allocations for shared memory. The only way in which shared memory might no be large enough is when the static allocation + memory allocation at kernel launch exceed the shared memory per SM, in which case the kernel does not launch at all.

EDIT: Correct typo that distorted the meaning.

Assuming you aren’t using malloc() in your kernel code, you will get a specific out of memory error code (cudaErrorMemoryAllocation, I believe) from cudaMalloc() if your request exceeds the available global memory. (Remember to check all those return codes, even if you just plan to abort!) An unspecified launch failure indicates something more like a memory access violation.

The only way a kernel can hit a shared memory limit is if it requests more than the total configured shared memory per multiprocessor (16 kB on compute capability < 2.0, 16 or 48 kB on compute capability 2.x, and 16, 32, or 48 kB on compute capability 3.0). In that case, the kernel will fail to launch, and the next CUDA function should return cudaErrorInvalidConfiguration. Because shared memory is not dynamically allocated, it is impossible for a kernel to hit an “out of shared memory” condition while running.


Yes I think that’s it. I am not sure what means SM? I still have some trouble with memory (as you can see), I hope to be clear enough.

I did some compute and I can launch kernel if the static allocation + memory allocation is greater than the total amount of shared memory per block given by deviceQuery (49152 bytes). Indeed, I allocate ( with malloc, so dynamic ? / and with float for example ) at least 1740 bytes and launch 54 threads per block (XBLOCK=54), which give 54*1740 = 93960 bytes per block (so in shared memory?) > 49152 bytes.

I may misunderstood the way of shared memory is used.


I try to do a memCpy after my kernel and it returns ‘unspecified launch failure’ if I increase XBLOCK or XGRID (I have YBLOCK and YGRID egal to 1). But if I have small values for XBLOCK and XGRID (54 and 35 for example) my program works (slowly but it works).

For each threads I make 2 malloc of 206*sizeof(float) = 1664 bytes, so maybe it is too big? If I improve XBLOCK, some malloc fail and I get ‘unspecified launch failure’ after the kernel (you will say of course your array is not define because malloc failed).

I really guess malloc failed because of lack of memory.

What do you think? Is there issue ti this problem, or I only have to try to reduce the use of memory?

I really appreciate your answers and I hope to be as clear as you, but I am not sure. Sorry if I am unclear.

Here SM stands for Streaming Multiprocessor, i.e. the instance that shares a shared memory block.

Sorry, I had a meaning-distorting typo in that post. You can have more than 49152 bytes of static memory, but not more of 49152 bytes of shared memory per block.

In-kernel malloc() (or any dynamic allocation for that matter) doesn’t give you shared memory, but global memory.

Thanks a lot, I will think about all of this to improve my program and I will ask other questions if I need

And what about the “usual allocation”, for example

float a;

Is it allocated in global memory as static allocation?

That will either give you a register, or “local” memory (basically global memory, with a different layout to improve coalescing).

One more time, thanks a lot tera. Your answers are really clear.