atomic ops problems

Hi,

i have some problems with atomic instructions, i get incorrect result, or timeouts or a rebooting machine…
after replacing my old kernel

global void houghKernel2(unsigned int* src,TColor* dst,int width,int height,int a,int n,float pia,unsigned int threshold,float n1)
{
const int x = blockDim.x * blockIdx.x + threadIdx.x;
const int y = blockDim.y * blockIdx.y + threadIdx.y;

...some logic...

1) dst[index]++;

}

with 1) => atomicInc((unsigned int*)&dst[index], 1);

It crashed the machine, or timeouted our does nearly NOTHING

I use XP, beta driver 180.x, VS2008 and Cuda 2.1

Machine 1: 8600GT - 50% reboot after executing, or timeout
Machine 2: 8800GTS - 10% timeout, 90% incorrect result ( array is nearly 0)

My old, non cuda mem-threadsafe, version executes in about 1sec with lightly varying but stable results.

Can somebody help me pleeeeease?

If this kernel were to act correctly, how many atomic increments would occur (total, over all memory locations)?

Max. 750M, avg ~100M

can you elaborate on your kernel ? from what you wrote i can’t make any sense of it, when do you use dest ? why do you need an atomic add ?

I’m doing a hough transformation from on an image, foreach edge pixel
iterate the buffer (dst) on the according sin/cos line. Because many edge pixel can
map to the same hough space (buffer) coordinate i need atomic inc, or memory gets
overriden by another thread.

so how did it work before without the atomic operations? can you post the kernel ? any ways one thing that can cause the problem is if you try and do global writes based on these atomic operations. Since global writes are not actually done when your code is executed, but cued up and flushed once in a while. hopefully soon Nvidia will expose a global memory flush command.

global void HoughKernel2(unsigned int* src,TColor* dst,int width,int height,int a,int n,float pia,unsigned int threshold,float n1)
{
const int x = blockDim.x * blockIdx.x + threadIdx.x;
const int y = blockDim.y * blockIdx.y + threadIdx.y;

#ifdef TESTRANGE
if( y>=0 && y<height && x>=0 && x<width)
#endif
{
unsigned int v0=src[widthy+x];
if(v0 > threshold)
{
for (int t = 0; t < a; t++)
{
float o = t * pia;
float r = x * __cosf(o) + y * __sinf(o);
int r2 = (int)(r + n1);
int index = t + a
r2;
#ifdef TESTRANGE
if(index>=0 && index<an)
#endif
//old: dst[index]++;
atomicInc((unsigned int
)&dst[index], 1);
}
}

}

}

I see no reason why this should cause problems, unless you are accessing dst out of bounds, have you tried running the code in emu mode ?

No, but with nonatomic increment it works

Somebody can give me an advice? Please? :-) I am really stuck on this

This worries me a tad. “Lightly varying”? What order errors are you getting? Sounds like it might be an array out of bounds error.

In most cases such problems occur in case of faulty kernel (memory bounds, compilation problems, etc).
My experience tells me that an argument “but it works with non-atomic” is rather weak, because faulty kernel can often lead to an absolutely unimaginable behavior.

I would recommend the following:

  1. Try to run on a different device (if the problem remains - than your device is good)
  2. Try to exclude as much code as you can (except faulty part) from the kernel, to localize the problem
  3. If in the result you will get a very simple kernel with about 100M atomic calls which reproducibly fails - you should better post this kernel here. I think in this case guys from Nvidia will show their interest in this.

how do you allocate dst ? and why do you need to use casting ? have you tried just using an int * ?

and i still recommend trying to use emu mode. some times it dose find the problem.