Strange behaviour in extended simulations

Epi · October 12, 2010, 1:33pm

Hey everyone,

I’m having trouble running a simulation using CUDA on my 8800GT. Basically the code simulates waves traveling through a medium at a high time resolution (1x10^-8 seconds) on a 2D mesh of 240x240 ‘cells’.

The simulation runs as expected for around 0.4 seconds of simulation time, so calling each kernel 4x10^7 times, then one of two things happens.

QNANs everywhere in the output.
Every kernel carries out no modification on the values passed into it each time it is called, the code continues to run but the output is the same at every time step.

On the surface it looks like I’ve made a maths error, but these effects happen at a random time after the start of the simulation, which suggests to me some sort of memory leak or suchlike. On some occasions the program runs perfectly to completion at 10^8 calls, on other occasions the output will fall into QNANs as early as 10^7 time steps. This seems to be completely random, although the code exhibits no other random behavior - for a given set of input parameters the output is almost exactly the same up until it fails at a random point.

Two things I though of were -

Is it ok to read a value from a different thread index inside a kernel, assuming you have set up write operations outside a cudaThreadSynchronise, such as:

global void kernel1(float * V, float * temp){
idx = …;
temp[idx] = V[idx] + V[idx + 1];
}
global void kernel2(float * V, float * temp){
idx = …;
V[idx] = temp[idx];
}

int main(){

kernel1<<dimgrid, dimblocks>>(V, temp);
cudaThreadSynchronise;
kernel2<<dimgrid, dimblocks>>(V, temp);
}

Is that ok? Excuse the quick pseudocode. There was another topic on this forum about this, but my problem seems different, I think I’ve covered all the bases that allowed his code to work, but it still fails after an extended time period.

Are there any other easy memory leaks or traps I could have fallen into? in addition, does doing anything else with your GPU effect CUDA in any other way? The problem seems more persistent if I use my computer for simple tasks such as web browsing while the code is running, but I’ve done nowhere near enough testing to say this is the case.

Unfortunately I can’t post my code as it’s covered by university research nonsense, but I’ll happily expand on the pseudocode if you think that will help.

Cheers,

Epi.

Epi · October 12, 2010, 1:33pm

Hey everyone,

I’m having trouble running a simulation using CUDA on my 8800GT. Basically the code simulates waves traveling through a medium at a high time resolution (1x10^-8 seconds) on a 2D mesh of 240x240 ‘cells’.

The simulation runs as expected for around 0.4 seconds of simulation time, so calling each kernel 4x10^7 times, then one of two things happens.

QNANs everywhere in the output.
Every kernel carries out no modification on the values passed into it each time it is called, the code continues to run but the output is the same at every time step.

On the surface it looks like I’ve made a maths error, but these effects happen at a random time after the start of the simulation, which suggests to me some sort of memory leak or suchlike. On some occasions the program runs perfectly to completion at 10^8 calls, on other occasions the output will fall into QNANs as early as 10^7 time steps. This seems to be completely random, although the code exhibits no other random behavior - for a given set of input parameters the output is almost exactly the same up until it fails at a random point.

Two things I though of were -

Is it ok to read a value from a different thread index inside a kernel, assuming you have set up write operations outside a cudaThreadSynchronise, such as:

global void kernel1(float * V, float * temp){
idx = …;
temp[idx] = V[idx] + V[idx + 1];
}
global void kernel2(float * V, float * temp){
idx = …;
V[idx] = temp[idx];
}

int main(){

kernel1<<dimgrid, dimblocks>>(V, temp);
cudaThreadSynchronise;
kernel2<<dimgrid, dimblocks>>(V, temp);
}

Is that ok? Excuse the quick pseudocode. There was another topic on this forum about this, but my problem seems different, I think I’ve covered all the bases that allowed his code to work, but it still fails after an extended time period.

Are there any other easy memory leaks or traps I could have fallen into? in addition, does doing anything else with your GPU effect CUDA in any other way? The problem seems more persistent if I use my computer for simple tasks such as web browsing while the code is running, but I’ve done nowhere near enough testing to say this is the case.

Unfortunately I can’t post my code as it’s covered by university research nonsense, but I’ll happily expand on the pseudocode if you think that will help.

Cheers,

Epi.

tera · October 12, 2010, 2:10pm

Your pseudocode looks correct, even without the cudaThreadSynchronise. Kernels from the same stream don’t overlap (and kernels don’t overlap at all on the 8800GT).

Are you checking return codes? That might give some hint at the nature of the failure. Particularly this would immediately show whether a failing memory allocation is the cause.

tera · October 12, 2010, 2:10pm

Your pseudocode looks correct, even without the cudaThreadSynchronise. Kernels from the same stream don’t overlap (and kernels don’t overlap at all on the 8800GT).

Are you checking return codes? That might give some hint at the nature of the failure. Particularly this would immediately show whether a failing memory allocation is the cause.

maxpower3141 · October 12, 2010, 2:31pm

The classical would be accessing uninitialized memory, for example, you had there

Have you covered the boundary conditions (idx + 1, case when idx gets to the boundary (array-size - 2) ?

One thing you could try is to memset all allocs always to zero before using and see if it affects (this would take care of “using uninitialized memory”-case where the former checks for out-of-bounds errors)

maxpower3141 · October 12, 2010, 2:31pm

The classical would be accessing uninitialized memory, for example, you had there

Have you covered the boundary conditions (idx + 1, case when idx gets to the boundary (array-size - 2) ?

One thing you could try is to memset all allocs always to zero before using and see if it affects (this would take care of “using uninitialized memory”-case where the former checks for out-of-bounds errors)

Epi · October 12, 2010, 4:40pm

I doubt the memory allocation is failing, as the memory is allocated on both the host and device at the start of the run, and then all that happens during runtime are changes to the device memory by a kernel as shown in my pseudocode, and by one cudamemCopyDeviceToHost every 10^5 timesteps/kernel calls to check the output. If memory allocation was a problem wouldn’t the code be likely to fail immediately instead of running fine for 10^7 kernel calls?

Yeah, pretty sure I have all the boundary conditions covered, Again, if any of these were wrong I would expect the computation would fail within the first few kernel calls rather than running for such an extended period of time.

Not sure what you mean about memsetting all allocs to zero before using, could you explain please? Do you just mean giving them an initial value of zero when the memory is first allocated? If so, I do that.

Epi · October 12, 2010, 4:40pm

I doubt the memory allocation is failing, as the memory is allocated on both the host and device at the start of the run, and then all that happens during runtime are changes to the device memory by a kernel as shown in my pseudocode, and by one cudamemCopyDeviceToHost every 10^5 timesteps/kernel calls to check the output. If memory allocation was a problem wouldn’t the code be likely to fail immediately instead of running fine for 10^7 kernel calls?

Yeah, pretty sure I have all the boundary conditions covered, Again, if any of these were wrong I would expect the computation would fail within the first few kernel calls rather than running for such an extended period of time.

Not sure what you mean about memsetting all allocs to zero before using, could you explain please? Do you just mean giving them an initial value of zero when the memory is first allocated? If so, I do that.

tera · October 12, 2010, 6:07pm

I thought you were allocating and deallocating somewhere in between, since you suspected a memory leak.

But not only the memory allocations, also kernels can fail with an error code. It’s just not as obvious with the kernels since they do not return the error code (because of the asynchronous call). You’ll have to explicitly check with cudaGetLastError() (after a cudaThreadSynchronize() to make sure the kernel has finished already).

tera · October 12, 2010, 6:07pm

I thought you were allocating and deallocating somewhere in between, since you suspected a memory leak.

But not only the memory allocations, also kernels can fail with an error code. It’s just not as obvious with the kernels since they do not return the error code (because of the asynchronous call). You’ll have to explicitly check with cudaGetLastError() (after a cudaThreadSynchronize() to make sure the kernel has finished already).

maxpower3141 · October 12, 2010, 6:28pm

Not necessarily - I think it would be possible to read garbage values (it could be “usually” zero that gets read, but something could write there something else that keeps culminating over the calls - but I suppose this is not the most likely reason.

Yeap, that’s what I meant - I also just happened to think of one more thing:

You had there something like V[idx] = … . + V[idx +1] - are these indices from thread-indices? Have you made sure you have no race-conditions between the threads? I think this sort of thing could fit your symptoms.

maxpower3141 · October 12, 2010, 6:28pm

Not necessarily - I think it would be possible to read garbage values (it could be “usually” zero that gets read, but something could write there something else that keeps culminating over the calls - but I suppose this is not the most likely reason.

Yeap, that’s what I meant - I also just happened to think of one more thing:

You had there something like V[idx] = … . + V[idx +1] - are these indices from thread-indices? Have you made sure you have no race-conditions between the threads? I think this sort of thing could fit your symptoms.

Epi · October 12, 2010, 7:00pm

Sorry, rather than a memory leak within my code I was wondering wether the GPU could have caused its own memory effects on CUDA allocated memory over an extended period of time, possibly freeing memory itself or something. I thought it unlikely, but figured I’d ask in case anyone had seen the problem before.

I’ll try running everything once more with cudaGetLastError()'s all over the place, good idea.

The [idx]/[idx+1] are a thread indices, but that’s why I have a kernel setting temp[idx] = V[idx + V[idx +1] etc., then a separate kernel setting V[idx] = temp[idx]. As far as I’m aware, this should remove any race conditions?

Epi · October 12, 2010, 7:00pm

Sorry, rather than a memory leak within my code I was wondering wether the GPU could have caused its own memory effects on CUDA allocated memory over an extended period of time, possibly freeing memory itself or something. I thought it unlikely, but figured I’d ask in case anyone had seen the problem before.

I’ll try running everything once more with cudaGetLastError()'s all over the place, good idea.

The [idx]/[idx+1] are a thread indices, but that’s why I have a kernel setting temp[idx] = V[idx + V[idx +1] etc., then a separate kernel setting V[idx] = temp[idx]. As far as I’m aware, this should remove any race conditions?

maxpower3141 · October 12, 2010, 8:15pm

Possible but highly unlikely.

This is good practice always.

Ok - just making sure. Btw. you do realize that using one kernel and __syncthreads() in the middle would give you more perf? Unless you really need global synchronization (For example because of the last element - this can be worked around also with even-odd kernel-calls and possibly also with atomics).

One more idea for debugging this would be to start working backwards where you get the nans from using debugger or similar (I haven’t really used this on-chip debugger - I used to do everything with the emulator before) - meaning create code to detect nan (x != x), add breakpoint and run until breakpoint reached to see where the number giving the first nan came from, and why. Also using just printfs, you can find the first nan and work backwards. I’m unsure whether cuda supports exceptions from nans, but I’d assume no.

maxpower3141 · October 12, 2010, 8:15pm

Possible but highly unlikely.

This is good practice always.

Ok - just making sure. Btw. you do realize that using one kernel and __syncthreads() in the middle would give you more perf? Unless you really need global synchronization (For example because of the last element - this can be worked around also with even-odd kernel-calls and possibly also with atomics).

One more idea for debugging this would be to start working backwards where you get the nans from using debugger or similar (I haven’t really used this on-chip debugger - I used to do everything with the emulator before) - meaning create code to detect nan (x != x), add breakpoint and run until breakpoint reached to see where the number giving the first nan came from, and why. Also using just printfs, you can find the first nan and work backwards. I’m unsure whether cuda supports exceptions from nans, but I’d assume no.

Topic		Replies	Views
Cuda code performance CUDA Programming and Performance	14	3156	December 16, 2014
Cuda program results are always zero in HW, correct in EMU? CUDA Programming and Performance	35	11165	May 23, 2010
Help with strange error CUDA Programming and Performance	8	2096	February 25, 2010
Annoying problems with memory and/or syntax CUDA Programming and Performance	19	4769	April 8, 2008
The Cuda 5 Second execution-time limit Finding a the way to work around the GDI timeout CUDA Programming and Performance	24	12732	July 26, 2010
well how do I know if cuda runs on the gpu CUDA Programming and Performance	20	13479	July 9, 2008
Can you GUESS this without experimenting? Latencies CUDA Programming and Performance	13	9351	January 7, 2008
How to debug kernel throwing an exception? CUDA Programming and Performance	16	7955	June 14, 2013
Using unified memory causes system crash CUDA Programming and Performance	28	5889	February 4, 2019
GPU and CPU don't run in (pure) parallel ? CUDA Programming and Performance	24	20152	May 4, 2007

Strange behaviour in extended simulations

Related topics