AtomicAdd faster than coalesced add. What is going on? GTX 275, consistently reproduceable

srobertson · November 12, 2009, 2:55am

Hey all,

Along with cbuchner1 and Keldor314, I’m taking a stab at an implementation of a fractal flame algorithm. As part of it, I was running some benchmarks, and got some results that were just absolutely weird.

I’ve attached a full copy of my PyCUDA benchmark code, but the relevant bits are:

__global__ void StraightAtomics(unsigned *fb) {

	unsigned offset = 242688 * blockIdx.x  + threadIdx.x;

	offset&=0xfffff;

	for (int i = 0; i < 512; i++) {

		for (int j = 0; j < $ROUNDS; j++) BoxMuller();

		atomicAdd(fb + offset, i);

		offset+=256;

		offset&=0xfffff;

	}

}

__global__ void StraightNormals(unsigned *fb) {

	unsigned offset = 242688 * blockIdx.x + threadIdx.x;

	offset&=0xfffff;

	for (int i = 0; i < 512; i++) {

		for (int j = 0; j < $ROUNDS; j++) BoxMuller();

		fb[offset] += i;

		offset+=256;

		offset&=0xfffff;

	}

}

The crappy pointer arithmetic ensures that a) memory accesses are coalesced and b) memory operations don’t start at the same location. $ROUNDS is varied from 0 to 29, five runs of each function are performed, and the results are saved.

If the logic isn’t there to prevent global memory conflicts, performance is what you’d expect - the non-atomic implementation (inappropriately called “Normals”) beats the pants off the atomic one, as requests have to be serialized. However, when there are no conflicts, these are the results:

Rounds   Normals		  Atomics		 

	   0 0.000917Â±0.00000 0.004624Â±0.00008

	   1 0.027546Â±0.00135 0.020884Â±0.00053

	   2 0.036111Â±0.00007 0.027453Â±0.00127

	   3 0.041217Â±0.00002 0.032355Â±0.00024

	   4 0.045586Â±0.00018 0.036154Â±0.00048

	   5 0.050101Â±0.00131 0.038987Â±0.00021

	   6 0.053252Â±0.00004 0.041376Â±0.00009

	   7 0.056971Â±0.00007 0.043718Â±0.00021

	   8 0.060746Â±0.00008 0.045821Â±0.00013

	   9 0.064611Â±0.00012 0.047931Â±0.00009

	  10 0.068391Â±0.00009 0.049803Â±0.00006

	  11 0.072300Â±0.00011 0.051715Â±0.00008

	  12 0.076367Â±0.00018 0.053554Â±0.00006

	  13 0.080500Â±0.00007 0.055464Â±0.00007

	  14 0.084474Â±0.00011 0.057433Â±0.00007

	  15 0.088776Â±0.00015 0.059370Â±0.00005

	  16 0.092849Â±0.00017 0.061340Â±0.00003

	  17 0.097094Â±0.00026 0.063281Â±0.00008

	  18 0.111041Â±0.01748 0.066393Â±0.00142

	  19 0.105781Â±0.00035 0.067214Â±0.00003

	  20 0.110084Â±0.00038 0.069422Â±0.00037

	  21 0.114514Â±0.00027 0.071344Â±0.00028

	  22 0.119140Â±0.00026 0.073255Â±0.00004

	  23 0.123246Â±0.00042 0.075432Â±0.00054

	  24 0.127901Â±0.00023 0.077364Â±0.00036

	  25 0.131909Â±0.00014 0.079238Â±0.00009

	  26 0.136760Â±0.00052 0.081254Â±0.00002

	  27 0.141022Â±0.00038 0.083338Â±0.00003

	  28 0.145320Â±0.00038 0.085451Â±0.00002

	  29 0.149576Â±0.00027 0.087835Â±0.00050

Note that as the number of rounds increases, both times increase linearly (more or less), and the slope for the version using AtomicAdd is about half of the non-atomic one. Which is absolutely bizarre, as the number of global memory operations is fixed regardless of the number of rounds of BoxMuller(). If the AtomicAdd version was simply faster, you might expect the two to increase at the same rate, separated by a constant indicating the cost of the memory operations, but that’s not how this is playing out.

I thought it might be something weird with the instruction cache, about which there is very little information, but frankly I have no clue what could be causing this. I also speculated that integer adds were handled by the memory controller at one time, but other benchmarks showed that that wasn’t the case, at least on a Compute 1.1 device, and even if this were true it would not explain these results on its own.

Any clues?

Thanks,

Steven
cuda_bench.py.txt (8.5 KB)

Keldor314 · November 16, 2009, 11:41pm

What’s your occupancy looking like? It might be that with the atomic, thread execution can continue without waiting for the result, whereas the standard memory operation has an implicit stall where it has to wait for the read operation before it can add and write the results. Good occupancy should hide this, but if you don’t have enough threads, you have to eat the stalls.

srobertson · November 22, 2009, 2:53am

24 warps/SM (256-thread blocks, 90-block grid, GTX 275). Although I’m a little less mystified after some more benchmarking (cf. http://forums.nvidia.com/index.php?showtopic=150856).