adding array elements in shared memory

I am using a C870 which does not have atomic operations. To record particle interactions I am using a 0-1 array in shared memory and then adding the contents to record if an interaction took place between the particles just considered. I am currently using the thread with lowest threadid to perform this addition and to write the result to global memory. How else can this addition and writing be performed?

So you are essential doing a stream compaction? A scan is what you want then. See the SDK example for scan and search the forums for stream compaction to get the details: [url=“site:forums.nvidia.com "stream compaction" - Google Search”]Google

What is your block size (blockDim.x)? and how many elements do you want to add?

block size is currently 16, but I intend to increase this to 64 because the occupancy calculator suggests a block size of 64 gives the optimum occupancy, so I currently want to add 16 elements ( and possibly 64 in later codes).