problem with execution time of atomic functions -- atomicInc() on a 1.3 machine against a 1.1 machin

Hi all,

I am running my code on two machines:
8800 gt – 1.1
quadro 5800fx – 1.3

i use one atomicInc() to assign value to some variable… d atomic increment works on unsigned int variable in device memory.

I am observing some serious timing overheads while doing this on quadro machine.

my kernel with atomicInc() takes:

On quadro — 5.8 msec
On 8800gt — 4.7 msec BOTH on exactly same code.

my kernel without atomicInc() (i.e i replace d result of atomicInc by a ZERO) takes:

On quadro — 1.9 msec
On 8800gt — 3.2 msec

SO code w/o atomic function scales well onto quadro from 8800 gt but does very badly on quadro when using atomicInc… any reason for such behavior?

Thanks u all for help!


The number of blocks is the same in both cases?

Yes number of blocks is the same… The whole code is exactly the same. I was just trying to see the speed-up that i would achieve with a faster 1.3 machine.

Please note that it just the atomic operation that is taking much more time to execute on d quadro machine as compared to the 8800gt.