Atomic Question

I want to perform the following command

    accum += (float) trg[pos];

Where pos is calculate in each thread.

Is this going to cause corruption of accum?

I can get round it by storing all the values of accum in an array and adding it together back on the CPU but thats going to use a ton of memory that I dont think i’m going to have going spare.

On another note can i pass CPP structures to CUDA?
I have a structure l_dash and I need access to l_dash->height and l_dash->width for my GPU code.

Cheers,

Chris

accum += (float) trg[pos];

This is a typical data collation problem faced at the end of a kernel.
The obvious way is to send it back to CPU for collation.

Alternatively you could try blocks synchronization to achieve this,
Refer my post where blocks sync seemed to work on my hardware!
http://forums.nvidia.com/index.php?showtopic=44144

I would first transfer the collated data to the shared memory and then apply block synchronization.The code would typically look as :

WAIT(S,G)

//Enter for master thread only or maybe shared memory region depends on your //implementation
accum += (float) trg[pos];

SIGNAL(S)

You don’t need block synchronization to do this. There is a simple and efficient way to perform accumulation (and many related computations) in parallel on the GPU. I suggest you have a look at the “scan” example in the SDK. Read the associated white paper: it is very informative.