I’m having trouble running a simulation using CUDA on my 8800GT. Basically the code simulates waves traveling through a medium at a high time resolution (1x10^-8 seconds) on a 2D mesh of 240x240 ‘cells’.
The simulation runs as expected for around 0.4 seconds of simulation time, so calling each kernel 4x10^7 times, then one of two things happens.
QNANs everywhere in the output.
Every kernel carries out no modification on the values passed into it each time it is called, the code continues to run but the output is the same at every time step.
On the surface it looks like I’ve made a maths error, but these effects happen at a random time after the start of the simulation, which suggests to me some sort of memory leak or suchlike. On some occasions the program runs perfectly to completion at 10^8 calls, on other occasions the output will fall into QNANs as early as 10^7 time steps. This seems to be completely random, although the code exhibits no other random behavior - for a given set of input parameters the output is almost exactly the same up until it fails at a random point.
Two things I though of were -
Is it ok to read a value from a different thread index inside a kernel, assuming you have set up write operations outside a cudaThreadSynchronise, such as:
Is that ok? Excuse the quick pseudocode. There was another topic on this forum about this, but my problem seems different, I think I’ve covered all the bases that allowed his code to work, but it still fails after an extended time period.
Are there any other easy memory leaks or traps I could have fallen into? in addition, does doing anything else with your GPU effect CUDA in any other way? The problem seems more persistent if I use my computer for simple tasks such as web browsing while the code is running, but I’ve done nowhere near enough testing to say this is the case.
Unfortunately I can’t post my code as it’s covered by university research nonsense, but I’ll happily expand on the pseudocode if you think that will help.
I’m having trouble running a simulation using CUDA on my 8800GT. Basically the code simulates waves traveling through a medium at a high time resolution (1x10^-8 seconds) on a 2D mesh of 240x240 ‘cells’.
The simulation runs as expected for around 0.4 seconds of simulation time, so calling each kernel 4x10^7 times, then one of two things happens.
QNANs everywhere in the output.
Every kernel carries out no modification on the values passed into it each time it is called, the code continues to run but the output is the same at every time step.
On the surface it looks like I’ve made a maths error, but these effects happen at a random time after the start of the simulation, which suggests to me some sort of memory leak or suchlike. On some occasions the program runs perfectly to completion at 10^8 calls, on other occasions the output will fall into QNANs as early as 10^7 time steps. This seems to be completely random, although the code exhibits no other random behavior - for a given set of input parameters the output is almost exactly the same up until it fails at a random point.
Two things I though of were -
Is it ok to read a value from a different thread index inside a kernel, assuming you have set up write operations outside a cudaThreadSynchronise, such as:
Is that ok? Excuse the quick pseudocode. There was another topic on this forum about this, but my problem seems different, I think I’ve covered all the bases that allowed his code to work, but it still fails after an extended time period.
Are there any other easy memory leaks or traps I could have fallen into? in addition, does doing anything else with your GPU effect CUDA in any other way? The problem seems more persistent if I use my computer for simple tasks such as web browsing while the code is running, but I’ve done nowhere near enough testing to say this is the case.
Unfortunately I can’t post my code as it’s covered by university research nonsense, but I’ll happily expand on the pseudocode if you think that will help.
Your pseudocode looks correct, even without the cudaThreadSynchronise. Kernels from the same stream don’t overlap (and kernels don’t overlap at all on the 8800GT).
Are you checking return codes? That might give some hint at the nature of the failure. Particularly this would immediately show whether a failing memory allocation is the cause.
Your pseudocode looks correct, even without the cudaThreadSynchronise. Kernels from the same stream don’t overlap (and kernels don’t overlap at all on the 8800GT).
Are you checking return codes? That might give some hint at the nature of the failure. Particularly this would immediately show whether a failing memory allocation is the cause.
The classical would be accessing uninitialized memory, for example, you had there
Have you covered the boundary conditions (idx + 1, case when idx gets to the boundary (array-size - 2) ?
One thing you could try is to memset all allocs always to zero before using and see if it affects (this would take care of “using uninitialized memory”-case where the former checks for out-of-bounds errors)
The classical would be accessing uninitialized memory, for example, you had there
Have you covered the boundary conditions (idx + 1, case when idx gets to the boundary (array-size - 2) ?
One thing you could try is to memset all allocs always to zero before using and see if it affects (this would take care of “using uninitialized memory”-case where the former checks for out-of-bounds errors)
I doubt the memory allocation is failing, as the memory is allocated on both the host and device at the start of the run, and then all that happens during runtime are changes to the device memory by a kernel as shown in my pseudocode, and by one cudamemCopyDeviceToHost every 10^5 timesteps/kernel calls to check the output. If memory allocation was a problem wouldn’t the code be likely to fail immediately instead of running fine for 10^7 kernel calls?
Yeah, pretty sure I have all the boundary conditions covered, Again, if any of these were wrong I would expect the computation would fail within the first few kernel calls rather than running for such an extended period of time.
Not sure what you mean about memsetting all allocs to zero before using, could you explain please? Do you just mean giving them an initial value of zero when the memory is first allocated? If so, I do that.
I doubt the memory allocation is failing, as the memory is allocated on both the host and device at the start of the run, and then all that happens during runtime are changes to the device memory by a kernel as shown in my pseudocode, and by one cudamemCopyDeviceToHost every 10^5 timesteps/kernel calls to check the output. If memory allocation was a problem wouldn’t the code be likely to fail immediately instead of running fine for 10^7 kernel calls?
Yeah, pretty sure I have all the boundary conditions covered, Again, if any of these were wrong I would expect the computation would fail within the first few kernel calls rather than running for such an extended period of time.
Not sure what you mean about memsetting all allocs to zero before using, could you explain please? Do you just mean giving them an initial value of zero when the memory is first allocated? If so, I do that.
I thought you were allocating and deallocating somewhere in between, since you suspected a memory leak.
But not only the memory allocations, also kernels can fail with an error code. It’s just not as obvious with the kernels since they do not return the error code (because of the asynchronous call). You’ll have to explicitly check with cudaGetLastError() (after a cudaThreadSynchronize() to make sure the kernel has finished already).
I thought you were allocating and deallocating somewhere in between, since you suspected a memory leak.
But not only the memory allocations, also kernels can fail with an error code. It’s just not as obvious with the kernels since they do not return the error code (because of the asynchronous call). You’ll have to explicitly check with cudaGetLastError() (after a cudaThreadSynchronize() to make sure the kernel has finished already).
Not necessarily - I think it would be possible to read garbage values (it could be “usually” zero that gets read, but something could write there something else that keeps culminating over the calls - but I suppose this is not the most likely reason.
Yeap, that’s what I meant - I also just happened to think of one more thing:
You had there something like V[idx] = … . + V[idx +1] - are these indices from thread-indices? Have you made sure you have no race-conditions between the threads? I think this sort of thing could fit your symptoms.
Not necessarily - I think it would be possible to read garbage values (it could be “usually” zero that gets read, but something could write there something else that keeps culminating over the calls - but I suppose this is not the most likely reason.
Yeap, that’s what I meant - I also just happened to think of one more thing:
You had there something like V[idx] = … . + V[idx +1] - are these indices from thread-indices? Have you made sure you have no race-conditions between the threads? I think this sort of thing could fit your symptoms.
Sorry, rather than a memory leak within my code I was wondering wether the GPU could have caused its own memory effects on CUDA allocated memory over an extended period of time, possibly freeing memory itself or something. I thought it unlikely, but figured I’d ask in case anyone had seen the problem before.
I’ll try running everything once more with cudaGetLastError()'s all over the place, good idea.
The [idx]/[idx+1] are a thread indices, but that’s why I have a kernel setting temp[idx] = V[idx + V[idx +1] etc., then a separate kernel setting V[idx] = temp[idx]. As far as I’m aware, this should remove any race conditions?
Sorry, rather than a memory leak within my code I was wondering wether the GPU could have caused its own memory effects on CUDA allocated memory over an extended period of time, possibly freeing memory itself or something. I thought it unlikely, but figured I’d ask in case anyone had seen the problem before.
I’ll try running everything once more with cudaGetLastError()'s all over the place, good idea.
The [idx]/[idx+1] are a thread indices, but that’s why I have a kernel setting temp[idx] = V[idx + V[idx +1] etc., then a separate kernel setting V[idx] = temp[idx]. As far as I’m aware, this should remove any race conditions?
Ok - just making sure. Btw. you do realize that using one kernel and __syncthreads() in the middle would give you more perf? Unless you really need global synchronization (For example because of the last element - this can be worked around also with even-odd kernel-calls and possibly also with atomics).
One more idea for debugging this would be to start working backwards where you get the nans from using debugger or similar (I haven’t really used this on-chip debugger - I used to do everything with the emulator before) - meaning create code to detect nan (x != x), add breakpoint and run until breakpoint reached to see where the number giving the first nan came from, and why. Also using just printfs, you can find the first nan and work backwards. I’m unsure whether cuda supports exceptions from nans, but I’d assume no.
Ok - just making sure. Btw. you do realize that using one kernel and __syncthreads() in the middle would give you more perf? Unless you really need global synchronization (For example because of the last element - this can be worked around also with even-odd kernel-calls and possibly also with atomics).
One more idea for debugging this would be to start working backwards where you get the nans from using debugger or similar (I haven’t really used this on-chip debugger - I used to do everything with the emulator before) - meaning create code to detect nan (x != x), add breakpoint and run until breakpoint reached to see where the number giving the first nan came from, and why. Also using just printfs, you can find the first nan and work backwards. I’m unsure whether cuda supports exceptions from nans, but I’d assume no.