using cudaMalloc and cudaFree within a loop unspecified launch failure!

Mirja · February 27, 2008, 7:59pm

Hello,

my program does the following:

some arrays are allocated on GPU (cudaMalloc) and values from CPU are copied to these new allocated arrays
in a kernel function I read these values to use them in calculations - I don’t write back new values to these global GPU arrays, i.e. I don’t change them.
after all the GPU memory of the arrays is set free (cudaFree)

These steps are done within a loop. I.e. in every loop arrays are allocated on GPU and are set free again.
I can’t do the allocating before because the sizes of these arrays differ and I would allocate too much memory most of the time.

Now my problem: The program runs fine some seconds and then, suddenly, it stops with the error message: unspecified launch failure

I have discovered that this error doesn’t occur, if I allocate all arrays before starting the loop. I.e. allocating at the beginning of the program (with size=worst case which isn’t a good solution at all) and setting free all memory at terminating the program.
Furthermore, the program runs fine, too, if in every step of the loop the array sizes don’t differ. (Just allocating and deallocating the same amount of memory in every step of the loop)

Has anybody else discovered something like that? It seems to me that it is crashing the GPU memory after some time, when allocating and deallocating (varying!) memory so many times in a loop!

By the way: In emu mode everything runs fine. So there cannot be accesses to array locations out of bound, can?

I have checked my code for so many, many times now. It makes me almost insane!!

Hope, someone has an idea to fix that problem…

Greetings,
Mirja

seibert · February 27, 2008, 8:46pm

I have seen similar behavior in my code, which allocates several arrays, runs through many kernel calls for a few hours, then frees the arrays and reallocates new ones with different sizes. After about 8-10 cycles of this (depending on the size of the arrays), I’ll get a kernel launch failure. My solution has been to avoid freeing and reallocating memory repeatedly, since I don’t have time to chase this bug, and it was easy to do.

My only guess is that this has something to do with memory fragmentation, as it seems happen sooner when I have lots of large arrays being freed and allocated in a non-regular order.

Mirja · February 28, 2008, 9:38am

Yes, the larger my arrays are the sooner my program crashes!
I’ve now thought about using always powers of two for the size of the arrays, but this also doesn’t work! External Image
Probably it’s the best for GPUs when allocating all memory at the beginning, isn’t it? But that’s such a bad idea in my case! :(

kristleifur · February 28, 2008, 10:03am

What about something like:

float *hugeArray = 0;

cudaMalloc(hugeArray, LOTS_OF_RAM);

float *arr_A = 0, *arr_B = 0, *arr_C = 0;

while (loop)

{

  arr_A = hugeArray[0];

  arr_B = &(hugeArray[arr_A_size]);

  arr_C = &(hugeArray[arr_A_size + arr_B_size]);

  mangleArrays<<<cudaStuff, cudaThings>>>(arr_A, arr_B, arr_C);

}

If you know any maximum size information at all about the total size of your arrays, you can allocate that as hugeArray and shift the pointers of the arrays around so they’ll fit, and you avoid the unstable and perhaps slow array allocation.

Edit: another thought - if the kernel / host code is asynchronous, it may be that the first threads of the functions return before all the other threads are finished, causing the loop code to deallocate the arrays while still having straggling threads running in your kernels. I doubt this is how CUDA works, but it may be worth a look?

Mirja · February 28, 2008, 10:30am

What about something like:
float *hugeArray = 0;

cudaMalloc(hugeArray, LOTS_OF_RAM);

float *arr_A = 0, *arr_B = 0, *arr_C = 0;

while (loop)

{

  arr_A = hugeArray[0];

  arr_B = &(hugeArray[arr_A_size]);

  arr_C = &(hugeArray[arr_A_size + arr_B_size]);

  mangleArrays<<<cudaStuff, cudaThings>>>(arr_A, arr_B, arr_C);

}
If you know any maximum size information at all about the total size of your arrays, you can allocate that as hugeArray and shift the pointers of the arrays around so they’ll fit, and you avoid the unstable and perhaps slow array allocation.

Edit: another thought - if the kernel / host code is asynchronous, it may be that the first threads of the functions return before all the other threads are finished, causing the loop code to deallocate the arrays while still having straggling threads running in your kernels. I doubt this is how CUDA works, but it may be worth a look?

[snapback]334955[/snapback]

Hi,

thank you for your hint! But by this way I have to allocate the maximum possible size (worst case) on GPU, right?

Why does it crash the program when allocating memory on GPU within a loop? I don’t understand! Is a GPU just not designed for such a task??

I don’t think my program crashes because the kernel/host code is asynchronous… I don’t free the arrays immediately after calling the kernel, I do some other things before that… Besides I think that the host code continues not until the kernel has stopped for all threads…?

wumpus · February 28, 2008, 10:32am

The only reasons I could think of would be:

some buffer overflow in your code overwrites important heap structures
somehow the freeing or allocating goes haywire (wrong pointers etc), causing a memory leak

kristleifur · February 28, 2008, 11:24am

I think Wumpus is probably right, but still: You may not have to allocate the ABSOLUTE worst case scenario as the huge array, only the “cumulative array size” worst case which may be lower than the absolute worst case.

That is, if array A can be up to [500000] but only if array B is < 100000, and array B can be [700000] only if A < 200000, then you can allocate a maximum of … what… [900000]? Does this make sense? I don’t know your code, but I have a hunch that you can do this. Might be faster, might be pointless and fragile. I’m basically speculating on how CUDA works to try to understand it better myself.

garciav · April 4, 2008, 9:00am

Does anybody have found a solution to this problem? I have checked my code for 1 entire week… I’m becoming totally crazy!!!

External Media

Vince

timcon · June 19, 2008, 7:29pm

I’m running into a very similar problem. The pre-allocation solution won’t work for me and I’m kinda beating my head against the wall External Image trying to figure out how to attack this. It seems to only appear when I’m allocating and freeing a lot of larger memory chunks. When I fail, it crashes for me in :

Program received signal SIGSEGV, Segmentation fault.
0x00002aec03620bc2 in cuTexRefSetAddress () from /usr/local/cuda/lib/libcuda.so
(gdb) where
#0 0x00002aec03620bc2 in cuTexRefSetAddress () from /usr/local/cuda/lib/libcuda.so
#1 0x00002aec03616a07 in cuTexRefSetAddress () from /usr/local/cuda/lib/libcuda.so
#2 0x00002aec0398c6e7 in cudaMallocPitch () from /usr/local/cuda/lib/libcudart.so

Any ideas?

Tim

yoavmor · July 24, 2008, 12:23pm

I have the same problem. Allocating and freeing large arrays causes an eventual failure. At first I thought I forgot to free one of the arrays but I have checked and re-checked the code. In addition, I did:

unsigned int freeMemory;

unsigned int totalMemory;

CUdevice dev;

CUcontext ctx;

int gpuCount;

cuInit(0);

cuDeviceGetCount(&gpuCount);

for (int i=0; i<gpuCount; i++)

  {

	cuDeviceGet(&dev,i);

	cuCtxCreate(&ctx, 0, dev);

	cuMemGetInfo (&freeMemory, &totalMemory);

	cuCtxDetach(ctx);

	printf ("Total GPU Memory: %d, free memory: %d\n", totalMemory, freeMemory);

  }

Every time I run the function I see that the free memory remains the same, no memory leak. And yet it fails allocating memory at some point. Could it be an issue with fragmentation…? Sometimes nothing else helps except restarting the computer… :crying:

I haven’t tried the new Cuda 2.0 (beta?). Maybe it’ll work better with that one…

Y.

Indogo2000 · July 27, 2008, 3:10pm

hmm
shooting in the dark here but
in your loop you might insert a

cudaThreadSynchronize();

afterthe kernel call and before the cudaFree() calls.
that might give your kernels all time to finish
their work and let them ALL settle down before you
deallocate the memory they’re trying to work on?

spacerat · July 28, 2008, 3:38am

I would just try to avoid the existing memory manager and use an alternative. A disadvantage of the build-in memory manager is also the slow speed.

You can allocate a large amount of memory at the beginning and then use the lib below to treat it as a memory pool. Problem will be fragmentation however…

[url=“dbestfit: Best-Fit Memory Allocation”]http://daniel.haxx.se/projects/dbestfit/[/url]

gatoatigrado · July 28, 2008, 4:00am

Could you try to reduce the problem to a relatively simple fragment of code, and post that? I haven’t run into the problem yet; I’m using nvcc built Tue_Jun_10_05:42:45_PDT_2008. You should try printing out all of the values from cudaMalloc and what you’re passing to cudaFree; I’ve made mistakes in wrapper classes… always good to double check (maybe use a script if the list is really long).

also, I had a section with a lot of variable length offsets… dunno if this is a good solution, but it might be more sane (less duplicate code) then recalculating all of the offsets

int arr_max_mem = 0;

queue arr_off;

for (...) {

   arr_off.push(arr_max_mem);

   arr_max_mem += <variable_mem_size>

}

cudaMalloc(arr_max_mem);

// pass ptrs with offsets from queue

if anyone knows how to enable any sort of array bounds checking for device emulation, that would be great.

immo · July 31, 2008, 8:17am

I think that I am experiencing the same problem as you, however I’m also receiving an exception:

First-chance exception at 0x7c812a5b in mshta.exe: Microsoft C++ exception: cudaError_enum at memory location 0x0997ece8…

I described it in this (rather unpopular) thread:
[url=“http://forums.nvidia.com/index.php?showtopic=73319”]http://forums.nvidia.com/index.php?showtopic=73319[/url]

Since the manual states that a kernel launch can also return errors caused by previous asynchronous calls, I think the kernel does not actually fail but a preceding asynchronous call, in my case a CudaFree. I check the errorresult of this cudaFree but it just returns cudaSuccess. However the manual says that “The functions that set memory” are asynchronous so I figure cudaFree might be among those. This would mean that a kernel launch directly following a cudaFree might fail while there is nothing wrong in the kernel itself. I suppose we can only check this by putting a cudaThreadSynchronize after EVERY (async) cuda call (slow!).

As for the actual cause of the error, I’d have to say that some sort of memory fragmentation sounds most appealing.

Esperient · September 10, 2008, 9:45pm

Count me in for another vote for the urgency of VC2008 support. Our app has been using 2008 for 6 months now, trying to hack the VC2005 cl.exe isn’t going to cut it if we are really going to support CUDA.

Is there a BETA I’m missing someplace with VC2008 support?

Mike

victor_gorin · September 21, 2008, 4:59am

Just a word of caution: from what I read, the kernel invocations are asynchronous, so you’d better use cudaThreadSynchonize() to make sure that all threads have finished before you proceed and free the memory from the host. You don’t need to call cudaThreadSynchronize explicitly though, if you, for example, already call cudaMemcpy(). cudaMemcpy() does it on your behalf.

Having said that, I am in your shoes. My problem looks very much like a memory leak or memory fragmentation. I also have reproducible failures. The failures occur quite predictably and strongly depend on the allocation sizes. The larger the size of the arrays, the sooner the failure occurs.

I wondered whether I could call cudaThreadExit() periodically and whether it can help, but I couldn’t find any additional information on it. I am still looking for a solution.

E.D_Riedijk · September 21, 2008, 6:01am

Just a general post.

If you are having a loop on CPU in which you allocate memory, call a kernel, deallocate memory. Then when you know the maximum size needed for the memory, it is always best to allocate that amount of memory beforehand, and not do the allocate-deallocate in a loop, even without this bug present.
Not only is it faster, you will also know from the beginning of the calculation if the GPU has enough memory to finish the calculations.

mfatica · September 22, 2008, 2:58am

It is a known bug in the driver, that has been fixed internally.
A fixed driver should be available shortly.

nldudok1 · October 1, 2008, 3:28am

I get the “unspecified launch failure” also when I only do kernel calls followed by cudaMemcpy DeviceToDevice calls in a loop.

between the kernel calls and the cudaMemcpys I have cudaThreadSynchronise calls.

This fails after 130 runs with “unspecified launch failure” at the cudaMemcpy.

Strangely enough. When I set the environment variable CUDA_LAUNCH_BLOCKING=1

Then it fails with “unspecified launch failure” at the cudaThreadSynchronise call after the cudaMemcpy

Would the fixed driver maybe also solve this?

I have been trying to solve this for some time now.

Any ideas on when this fixed driver will be available for linux?

prcs · October 23, 2008, 3:28pm

Any updates on this?

I’ve tried driver 178.24 and 180.42, on Vista SP1 x64, yet both in vain.

Topic		Replies	Views
cudaMalloc error in big loop CUDA Programming and Performance	12	15594	May 21, 2008
Cudamalloc attempting to allocate more memory than it is supposed to CUDA Programming and Performance cuda	15	28	January 13, 2025
CudaMalloc on Vista : strange behaviour Works on XP, Fails on Vista CUDA Programming and Performance	6	12258	July 1, 2009
Using unified memory causes system crash CUDA Programming and Performance	28	5807	February 4, 2019
How to cudaMalloc two-dimensional array ? CUDA Programming and Performance	46	66109	September 7, 2023
cudamalloc struct problems - unspecified launch failure CUDA Programming and Performance	16	3611	April 8, 2011
Unable to Allocate on Memory and Stack Overflow CUDA Programming and Performance	11	2532	September 6, 2011
Pinned memory limit CUDA Programming and Performance	16	13358	May 1, 2016
cudaFree is returning an unrecognised error code CUDA Programming and Performance	10	7921	March 13, 2009
cudaMalloc causes segmentation fault 2 Mo is far from my 1,2 Go card memory limit CUDA Programming and Performance	7	7461	June 28, 2011

using cudaMalloc and cudaFree within a loop unspecified launch failure!

Related topics