Cost of shared memory write

Hi guys,

I’ve looked around and cant find much information on the topic of shared memory write times. I’ve already optimized my code to be conflict-free when reading (i.e. 0 warp_serializations when reading from shared) however I seem to be paying a huge price when attempting to write to shared memory.

I’m deploying a grid with blocks containing 16x16 threads. Each block reads some values into shared memory from global memory and begins processing. Thread reads are aligned so that I have 0 serializations. After some amount of processing I need to synchronize down the columns of the matrix, so I have an array in shared memory of size 16. Each thread performs an atomicOr() with a location in the shared memory array.

By simply adding the atomicOr() operation my kernel time(obtained via the profiler) goes from 20ms to 500ms. Since the loop is run 48 times that boils down to 10ms per atomicOr(), which seems extremely expensive.

Is there a way to estimate the cost of writes to shared memory? Is there a better way to do this? My code is writing 1 value to each bank, so I shouldn’t have any bank conflicts.

Thanks,
Jon