Hey all,
Along with cbuchner1 and Keldor314, I’m taking a stab at an implementation of a fractal flame algorithm. As part of it, I was running some benchmarks, and got some results that were just absolutely weird.
I’ve attached a full copy of my PyCUDA benchmark code, but the relevant bits are:
__global__ void StraightAtomics(unsigned *fb) {
unsigned offset = 242688 * blockIdx.x + threadIdx.x;
offset&=0xfffff;
for (int i = 0; i < 512; i++) {
for (int j = 0; j < $ROUNDS; j++) BoxMuller();
atomicAdd(fb + offset, i);
offset+=256;
offset&=0xfffff;
}
}
__global__ void StraightNormals(unsigned *fb) {
unsigned offset = 242688 * blockIdx.x + threadIdx.x;
offset&=0xfffff;
for (int i = 0; i < 512; i++) {
for (int j = 0; j < $ROUNDS; j++) BoxMuller();
fb[offset] += i;
offset+=256;
offset&=0xfffff;
}
}
The crappy pointer arithmetic ensures that a) memory accesses are coalesced and b) memory operations don’t start at the same location. $ROUNDS is varied from 0 to 29, five runs of each function are performed, and the results are saved.
If the logic isn’t there to prevent global memory conflicts, performance is what you’d expect - the non-atomic implementation (inappropriately called “Normals”) beats the pants off the atomic one, as requests have to be serialized. However, when there are no conflicts, these are the results:
Rounds Normals Atomics
0 0.000917±0.00000 0.004624±0.00008
1 0.027546±0.00135 0.020884±0.00053
2 0.036111±0.00007 0.027453±0.00127
3 0.041217±0.00002 0.032355±0.00024
4 0.045586±0.00018 0.036154±0.00048
5 0.050101±0.00131 0.038987±0.00021
6 0.053252±0.00004 0.041376±0.00009
7 0.056971±0.00007 0.043718±0.00021
8 0.060746±0.00008 0.045821±0.00013
9 0.064611±0.00012 0.047931±0.00009
10 0.068391±0.00009 0.049803±0.00006
11 0.072300±0.00011 0.051715±0.00008
12 0.076367±0.00018 0.053554±0.00006
13 0.080500±0.00007 0.055464±0.00007
14 0.084474±0.00011 0.057433±0.00007
15 0.088776±0.00015 0.059370±0.00005
16 0.092849±0.00017 0.061340±0.00003
17 0.097094±0.00026 0.063281±0.00008
18 0.111041±0.01748 0.066393±0.00142
19 0.105781±0.00035 0.067214±0.00003
20 0.110084±0.00038 0.069422±0.00037
21 0.114514±0.00027 0.071344±0.00028
22 0.119140±0.00026 0.073255±0.00004
23 0.123246±0.00042 0.075432±0.00054
24 0.127901±0.00023 0.077364±0.00036
25 0.131909±0.00014 0.079238±0.00009
26 0.136760±0.00052 0.081254±0.00002
27 0.141022±0.00038 0.083338±0.00003
28 0.145320±0.00038 0.085451±0.00002
29 0.149576±0.00027 0.087835±0.00050
Note that as the number of rounds increases, both times increase linearly (more or less), and the slope for the version using AtomicAdd is about half of the non-atomic one. Which is absolutely bizarre, as the number of global memory operations is fixed regardless of the number of rounds of BoxMuller(). If the AtomicAdd version was simply faster, you might expect the two to increase at the same rate, separated by a constant indicating the cost of the memory operations, but that’s not how this is playing out.
I thought it might be something weird with the instruction cache, about which there is very little information, but frankly I have no clue what could be causing this. I also speculated that integer adds were handled by the memory controller at one time, but other benchmarks showed that that wasn’t the case, at least on a Compute 1.1 device, and even if this were true it would not explain these results on its own.
Any clues?
Thanks,
Steven
cuda_bench.py.txt (8.5 KB)