using cudaMalloc and cudaFree within a loop unspecified launch failure!


my program does the following:

  • some arrays are allocated on GPU (cudaMalloc) and values from CPU are copied to these new allocated arrays
  • in a kernel function I read these values to use them in calculations - I don’t write back new values to these global GPU arrays, i.e. I don’t change them.
  • after all the GPU memory of the arrays is set free (cudaFree)

These steps are done within a loop. I.e. in every loop arrays are allocated on GPU and are set free again.
I can’t do the allocating before because the sizes of these arrays differ and I would allocate too much memory most of the time.

Now my problem: The program runs fine some seconds and then, suddenly, it stops with the error message: unspecified launch failure

I have discovered that this error doesn’t occur, if I allocate all arrays before starting the loop. I.e. allocating at the beginning of the program (with size=worst case which isn’t a good solution at all) and setting free all memory at terminating the program.
Furthermore, the program runs fine, too, if in every step of the loop the array sizes don’t differ. (Just allocating and deallocating the same amount of memory in every step of the loop)

Has anybody else discovered something like that? It seems to me that it is crashing the GPU memory after some time, when allocating and deallocating (varying!) memory so many times in a loop!

By the way: In emu mode everything runs fine. So there cannot be accesses to array locations out of bound, can?

I have checked my code for so many, many times now. It makes me almost insane!!

Hope, someone has an idea to fix that problem…


I have seen similar behavior in my code, which allocates several arrays, runs through many kernel calls for a few hours, then frees the arrays and reallocates new ones with different sizes. After about 8-10 cycles of this (depending on the size of the arrays), I’ll get a kernel launch failure. My solution has been to avoid freeing and reallocating memory repeatedly, since I don’t have time to chase this bug, and it was easy to do.

My only guess is that this has something to do with memory fragmentation, as it seems happen sooner when I have lots of large arrays being freed and allocated in a non-regular order.

Yes, the larger my arrays are the sooner my program crashes!
I’ve now thought about using always powers of two for the size of the arrays, but this also doesn’t work! :no:
Probably it’s the best for GPUs when allocating all memory at the beginning, isn’t it? But that’s such a bad idea in my case! :(

What about something like:

float *hugeArray = 0;

cudaMalloc(hugeArray, LOTS_OF_RAM);

float *arr_A = 0, *arr_B = 0, *arr_C = 0;

while (loop)


  arr_A = hugeArray[0];

  arr_B = &(hugeArray[arr_A_size]);

  arr_C = &(hugeArray[arr_A_size + arr_B_size]);

  mangleArrays<<<cudaStuff, cudaThings>>>(arr_A, arr_B, arr_C);


If you know any maximum size information at all about the total size of your arrays, you can allocate that as hugeArray and shift the pointers of the arrays around so they’ll fit, and you avoid the unstable and perhaps slow array allocation.

Edit: another thought - if the kernel / host code is asynchronous, it may be that the first threads of the functions return before all the other threads are finished, causing the loop code to deallocate the arrays while still having straggling threads running in your kernels. I doubt this is how CUDA works, but it may be worth a look?


thank you for your hint! But by this way I have to allocate the maximum possible size (worst case) on GPU, right?

Why does it crash the program when allocating memory on GPU within a loop? I don’t understand! Is a GPU just not designed for such a task??

I don’t think my program crashes because the kernel/host code is asynchronous… I don’t free the arrays immediately after calling the kernel, I do some other things before that… Besides I think that the host code continues not until the kernel has stopped for all threads…?

The only reasons I could think of would be:

  • some buffer overflow in your code overwrites important heap structures
  • somehow the freeing or allocating goes haywire (wrong pointers etc), causing a memory leak

I think Wumpus is probably right, but still: You may not have to allocate the ABSOLUTE worst case scenario as the huge array, only the “cumulative array size” worst case which may be lower than the absolute worst case.

That is, if array A can be up to [500000] but only if array B is < 100000, and array B can be [700000] only if A < 200000, then you can allocate a maximum of … what… [900000]? Does this make sense? I don’t know your code, but I have a hunch that you can do this. Might be faster, might be pointless and fragile. I’m basically speculating on how CUDA works to try to understand it better myself.

Does anybody have found a solution to this problem? I have checked my code for 1 entire week… I’m becoming totally crazy!!!



I’m running into a very similar problem. The pre-allocation solution won’t work for me and I’m kinda beating my head against the wall :argh: trying to figure out how to attack this. It seems to only appear when I’m allocating and freeing a lot of larger memory chunks. When I fail, it crashes for me in :

Program received signal SIGSEGV, Segmentation fault.
0x00002aec03620bc2 in cuTexRefSetAddress () from /usr/local/cuda/lib/
(gdb) where
#0 0x00002aec03620bc2 in cuTexRefSetAddress () from /usr/local/cuda/lib/
#1 0x00002aec03616a07 in cuTexRefSetAddress () from /usr/local/cuda/lib/
#2 0x00002aec0398c6e7 in cudaMallocPitch () from /usr/local/cuda/lib/

Any ideas?


I have the same problem. Allocating and freeing large arrays causes an eventual failure. At first I thought I forgot to free one of the arrays but I have checked and re-checked the code. In addition, I did:

unsigned int freeMemory;

unsigned int totalMemory;

CUdevice dev;

CUcontext ctx;

int gpuCount;



for (int i=0; i<gpuCount; i++)



	cuCtxCreate(&ctx, 0, dev);

	cuMemGetInfo (&freeMemory, &totalMemory);


	printf ("Total GPU Memory: %d, free memory: %d\n", totalMemory, freeMemory);


Every time I run the function I see that the free memory remains the same, no memory leak. And yet it fails allocating memory at some point. Could it be an issue with fragmentation…? Sometimes nothing else helps except restarting the computer… :crying:

I haven’t tried the new Cuda 2.0 (beta?). Maybe it’ll work better with that one…


shooting in the dark here but
in your loop you might insert a


afterthe kernel call and before the cudaFree() calls.
that might give your kernels all time to finish
their work and let them ALL settle down before you
deallocate the memory they’re trying to work on?

I would just try to avoid the existing memory manager and use an alternative. A disadvantage of the build-in memory manager is also the slow speed.

You can allocate a large amount of memory at the beginning and then use the lib below to treat it as a memory pool. Problem will be fragmentation however…

Could you try to reduce the problem to a relatively simple fragment of code, and post that? I haven’t run into the problem yet; I’m using nvcc built Tue_Jun_10_05:42:45_PDT_2008. You should try printing out all of the values from cudaMalloc and what you’re passing to cudaFree; I’ve made mistakes in wrapper classes… always good to double check (maybe use a script if the list is really long).

also, I had a section with a lot of variable length offsets… dunno if this is a good solution, but it might be more sane (less duplicate code) then recalculating all of the offsets

int arr_max_mem = 0;

queue arr_off;

for (...) {


   arr_max_mem += <variable_mem_size>



// pass ptrs with offsets from queue

if anyone knows how to enable any sort of array bounds checking for device emulation, that would be great.

I think that I am experiencing the same problem as you, however I’m also receiving an exception:

First-chance exception at 0x7c812a5b in mshta.exe: Microsoft C++ exception: cudaError_enum at memory location 0x0997ece8…

I described it in this (rather unpopular) thread:

Since the manual states that a kernel launch can also return errors caused by previous asynchronous calls, I think the kernel does not actually fail but a preceding asynchronous call, in my case a CudaFree. I check the errorresult of this cudaFree but it just returns cudaSuccess. However the manual says that “The functions that set memory” are asynchronous so I figure cudaFree might be among those. This would mean that a kernel launch directly following a cudaFree might fail while there is nothing wrong in the kernel itself. I suppose we can only check this by putting a cudaThreadSynchronize after EVERY (async) cuda call (slow!).

As for the actual cause of the error, I’d have to say that some sort of memory fragmentation sounds most appealing.

Count me in for another vote for the urgency of VC2008 support. Our app has been using 2008 for 6 months now, trying to hack the VC2005 cl.exe isn’t going to cut it if we are really going to support CUDA.

Is there a BETA I’m missing someplace with VC2008 support?


Just a word of caution: from what I read, the kernel invocations are asynchronous, so you’d better use cudaThreadSynchonize() to make sure that all threads have finished before you proceed and free the memory from the host. You don’t need to call cudaThreadSynchronize explicitly though, if you, for example, already call cudaMemcpy(). cudaMemcpy() does it on your behalf.

Having said that, I am in your shoes. My problem looks very much like a memory leak or memory fragmentation. I also have reproducible failures. The failures occur quite predictably and strongly depend on the allocation sizes. The larger the size of the arrays, the sooner the failure occurs.

I wondered whether I could call cudaThreadExit() periodically and whether it can help, but I couldn’t find any additional information on it. I am still looking for a solution.

Just a general post.

If you are having a loop on CPU in which you allocate memory, call a kernel, deallocate memory. Then when you know the maximum size needed for the memory, it is always best to allocate that amount of memory beforehand, and not do the allocate-deallocate in a loop, even without this bug present.
Not only is it faster, you will also know from the beginning of the calculation if the GPU has enough memory to finish the calculations.

It is a known bug in the driver, that has been fixed internally.
A fixed driver should be available shortly.

I get the “unspecified launch failure” also when I only do kernel calls followed by cudaMemcpy DeviceToDevice calls in a loop.

between the kernel calls and the cudaMemcpys I have cudaThreadSynchronise calls.

This fails after 130 runs with “unspecified launch failure” at the cudaMemcpy.

Strangely enough. When I set the environment variable CUDA_LAUNCH_BLOCKING=1

Then it fails with “unspecified launch failure” at the cudaThreadSynchronise call after the cudaMemcpy

Would the fixed driver maybe also solve this?

I have been trying to solve this for some time now.

Any ideas on when this fixed driver will be available for linux?

Any updates on this?

I’ve tried driver 178.24 and 180.42, on Vista SP1 x64, yet both in vain.