atomic performance under fermi need a solution to scattered write problem

springc · September 1, 2011, 5:00pm

Hi, I have an application that will do scattered write. Each thread will write to random memory locations. We did some test using atomic operation on the GTX 295 card before and the preformance is very bad. Now I heard that the fermi card have the cache that would speed up the atomic operation tremendeously. I don’t have a card at my hand but I am wondering how fast it can be. Is atomicAdd 10 times slower than normal write or better?
Also,does the performance depends on the degree of collision? Suppose I have 1000 threads write at the same time, 10 thread collisions vs 100 collisions, each collision have 2 threads writing to same location, will the performance be the same or the less number of collision run faster? What if in one case, there are 10 threads writing to the same location, will that run slower?

seibert · September 3, 2011, 3:27pm

Hi, I have an application that will do scattered write. Each thread will write to random memory locations. We did some test using atomic operation on the GTX 295 card before and the preformance is very bad. Now I heard that the fermi card have the cache that would speed up the atomic operation tremendeously. I don’t have a card at my hand but I am wondering how fast it can be. Is atomicAdd 10 times slower than normal write or better?

Also,does the performance depends on the degree of collision? Suppose I have 1000 threads write at the same time, 10 thread collisions vs 100 collisions, each collision have 2 threads writing to same location, will the performance be the same or the less number of collision run faster? What if in one case, there are 10 threads writing to the same location, will that run slower?

AtomicAdd is much faster with Fermi thanks to the L2 cache. In cases with minimal collisions, the speed of atomicAdd is about 20x better on Fermi than the GTX 200 series. Because atomics (by definition) require serializing accesses to the same memory location, your performance does depend on the collision rate.

Justin_Luitjens · September 3, 2011, 4:37pm

Hi, I have an application that will do scattered write. Each thread will write to random memory locations. We did some test using atomic operation on the GTX 295 card before and the preformance is very bad. Now I heard that the fermi card have the cache that would speed up the atomic operation tremendeously. I don’t have a card at my hand but I am wondering how fast it can be. Is atomicAdd 10 times slower than normal write or better?

Also,does the performance depends on the degree of collision? Suppose I have 1000 threads write at the same time, 10 thread collisions vs 100 collisions, each collision have 2 threads writing to same location, will the performance be the same or the less number of collision run faster? What if in one case, there are 10 threads writing to the same location, will that run slower?

The performance is dependent on the number of collisions as it takes time to resolve the conflicts.

Generally it is a good idea to avoid atomics if you are doing a lot of them or expecting a lot of collisions.

seibert · September 3, 2011, 6:10pm

I find the speed of atomics on Fermi now to be sufficient for general purpose histogramming, at least in cases where I have > 50 bins. There are more sophisticated histogram techniques to reduce the usage of atomics, but when the histogram filling operation itself is not the primary bottleneck, a simple atomicAdd is pretty good.

Topic		Replies	Views
Fermi atomic op 10 times slower than ATI GPU? CUDA Programming and Performance	4	10036	July 25, 2011
Taking apart global atomics performance performance, graphs, theories CUDA Programming and Performance	23	7923	January 28, 2012
AtomicAdd faster than coalesced add. What is going on? GTX 275, consistently reproduceable CUDA Programming and Performance	2	1877	November 22, 2009
Shared memory atomics and other questions. CUDA Programming and Performance	19	13885	November 13, 2010
Performance of Histogram256 demo Atomic writes are slow when conflict CUDA Programming and Performance	5	5815	August 1, 2008
Atomic Functions Performance CUDA Programming and Performance	6	3776	August 22, 2008
Atomic operation performance CUDA Programming and Performance	6	7195	July 7, 2010
Even more Fermi Fun: Uncoalesced writes CUDA Programming and Performance	8	8963	June 5, 2010
Atomics on Kepler CUDA Programming and Performance	0	765	February 19, 2014
atomic read or write CUDA Programming and Performance	3	4226	July 15, 2009

atomic performance under fermi need a solution to scattered write problem

Related topics