Some additional metrics I gathered were:
. mcx_65-75 mcx_75
inst_per_warp 89686592.8 1348325546 15x
warp_execution_efficiency 46% 3% 15x
The huge difference in these two metrics led me to look for highly variable behavior thread-to-thread, in terms of a difference between the number of active threads per warp. The indication from above is that some warps are executing for long periods of time with only one thread out of the warp (~3%) active, in the “slow” case. In the “fast” case, this is less evident, as indicated by the lower average instructions per warp executed, as well as the higher average number of active threads per warp (~50%)
My analysis so far shows that most of the execution time difference is associated with this section of the mcx_main_loop kernel:
if((mediaid==0 && (!gcfg->doreflect || (gcfg->doreflect && n1==gproperty[mediaid].w))) || f.t>gcfg->twin1){
GPUDEBUG(("direct relaunch at idx=[%d] mediaid=[%d], ref=[%d]\n",idx1d,mediaid,gcfg->doreflect));
if(launchnewphoton(&p,v,&f,&rv,&prop,&idx1d,&mediaid,&w0,&Lmove,(mediaidold & DET_MASK),ppath,
&energyloss,&energylaunched,n_det,detectedphoton,t,tnew,photonseed,media,srcpattern,idx,(RandType*)n_seed,seeddata))
break;
continue;
}
Each thread in the kernel runs somewhere around ~120000 times through the main while-loop in the kernel. Out of those total iterations, this if-statement is entered (if condition is satisfied) about 610 times per thread.
Measuring using clock64(), this block of code uses about 50,000,000 clocks, fairly constant (+/- 20%) across threads, in aggregate across the 120000 iterations, for the CUDA 6.5 case. Most of that clock usage occurs on the 610 times the if-statement is entered (obviously).
In the CUDA 7.5 case, the while loop iterations are approximately the same (~120,000), and the if-block is entered the same 610 times per thread, but the aggregate clock measurement is highly variable, from a low of around 20,000,000 clocks per thread, to a high of around 12,000,000,000 clocks per thread. These excursions to such a high number occur often enough across threads that many, if not most, warps are affected. These excursions are driving the huge difference in the two metrics I referenced earlier.
Presumably the main issue is the behavior of launchnewphoton, which is a fairly involved function. This function is actually called at another place, once per kernel launch, just prior to the while-loop. Timing for this particular call of the function for the CUDA 6.5 case also varies across threads, around 700 clocks at the low end to a few threads at 4500 clocks at the high end. The majority of threads are at ~1000 clocks +/- 30%. If I divide the aggregate time for the function call in the while loop by the number of times entered (610) I get about 65,000 clocks per call, so something is very different even in the CUDA 6.5 case, between the cost of this function before the while loop, and the cost in the while loop.
Looking at these same numbers for the CUDA 7.5 case, (again, there is huge variability with threads in the same warp) for the initial call, the timing varies across threads from a low of around 300 clocks to a high of almost 3000 clocks. Again, within the while loop the function is called 610 times per thread, and the worst case aggregate timing divided by 610 yields about an average of 20,000,000 clocks for a single call of the function.
Anyway there is significant variability in the launchnewphoton function. I’m not sure yet if it is a data-dependent loop variation (there are various loops in the function), some kind of compiler bug, or something else.
Here is some “raw” data. c1 is the number of times the if-statement body is executed. t1 is the clock64() timing for the first (prior to while-loop) executing of the launchnewphoton function. t2 is the sum aggregate of the total time for the section of code I show above across all iterations of the while loop.
For the CUDA 6.5 case:
thread: 1420, c1: 611, t1: 2859, t2: 49070096
thread: 1421, c1: 611, t1: 2859, t2: 42560849
thread: 1422, c1: 611, t1: 2859, t2: 50754031
thread: 1423, c1: 611, t1: 2859, t2: 44682728
thread: 1424, c1: 611, t1: 2859, t2: 43948369
thread: 1425, c1: 611, t1: 2859, t2: 41965275
thread: 1426, c1: 611, t1: 2859, t2: 44488303
thread: 1427, c1: 611, t1: 2859, t2: 48930834
thread: 1428, c1: 611, t1: 2859, t2: 44219806
thread: 1429, c1: 611, t1: 2859, t2: 52256397
thread: 1430, c1: 611, t1: 2859, t2: 50421419
thread: 1431, c1: 611, t1: 2859, t2: 49058472
thread: 1432, c1: 611, t1: 2859, t2: 50906550
thread: 1433, c1: 611, t1: 2859, t2: 52490246
thread: 1434, c1: 611, t1: 2859, t2: 43482271
thread: 1435, c1: 611, t1: 2859, t2: 50507306
thread: 1436, c1: 611, t1: 2859, t2: 50133962
thread: 1437, c1: 611, t1: 2859, t2: 49184269
thread: 1438, c1: 611, t1: 2859, t2: 44272766
thread: 1439, c1: 611, t1: 2859, t2: 41139068
thread: 1440, c1: 611, t1: 2600, t2: 53743146
thread: 1441, c1: 611, t1: 2600, t2: 53142962
thread: 1442, c1: 611, t1: 2600, t2: 40498249
thread: 1443, c1: 611, t1: 2600, t2: 47598617
thread: 1444, c1: 611, t1: 2600, t2: 48931713
thread: 1445, c1: 611, t1: 2600, t2: 47855590
thread: 1446, c1: 611, t1: 2600, t2: 51535408
thread: 1447, c1: 611, t1: 2600, t2: 50559446
thread: 1448, c1: 611, t1: 2600, t2: 46043539
thread: 1449, c1: 611, t1: 2600, t2: 47411201
thread: 1450, c1: 611, t1: 2600, t2: 46184805
thread: 1451, c1: 611, t1: 2600, t2: 53590791
thread: 1452, c1: 611, t1: 2600, t2: 43928864
thread: 1453, c1: 611, t1: 2600, t2: 51324045
thread: 1454, c1: 611, t1: 2600, t2: 47044995
thread: 1455, c1: 611, t1: 2600, t2: 39032208
thread: 1456, c1: 611, t1: 2600, t2: 39834598
thread: 1457, c1: 611, t1: 2600, t2: 49439931
thread: 1458, c1: 611, t1: 2600, t2: 45755023
thread: 1459, c1: 611, t1: 2600, t2: 45037481
thread: 1460, c1: 611, t1: 2600, t2: 46636916
thread: 1461, c1: 611, t1: 2600, t2: 48531856
thread: 1462, c1: 611, t1: 2600, t2: 47624660
thread: 1463, c1: 611, t1: 2600, t2: 49262769
thread: 1464, c1: 611, t1: 2600, t2: 50782732
CUDA 7.5 case:
thread: 1420, c1: 611, t1: 2746, t2: 7185675356
thread: 1421, c1: 611, t1: 2746, t2: 3768470065
thread: 1422, c1: 611, t1: 2746, t2: 5976327405
thread: 1423, c1: 611, t1: 2746, t2: 8327436533
thread: 1424, c1: 611, t1: 2746, t2: 11249497893
thread: 1425, c1: 611, t1: 2746, t2: 8007521379
thread: 1426, c1: 611, t1: 2746, t2: 5319583916
thread: 1427, c1: 611, t1: 2746, t2: 4492991330
thread: 1428, c1: 611, t1: 2746, t2: 10530867654
thread: 1429, c1: 611, t1: 2746, t2: 11564183908
thread: 1430, c1: 611, t1: 2746, t2: 6381597042
thread: 1431, c1: 611, t1: 2746, t2: 1806753661
thread: 1432, c1: 611, t1: 2746, t2: 2967708785
thread: 1433, c1: 611, t1: 2746, t2: 2533653114
thread: 1434, c1: 611, t1: 2746, t2: 10199807708
thread: 1435, c1: 611, t1: 2746, t2: 9802474834
thread: 1436, c1: 611, t1: 2746, t2: 4096053944
thread: 1437, c1: 611, t1: 2746, t2: 10869798193
thread: 1438, c1: 611, t1: 2746, t2: 2189946079
thread: 1439, c1: 611, t1: 2746, t2: 5658245992
thread: 1440, c1: 611, t1: 1892, t2: 7462091381
thread: 1441, c1: 611, t1: 1892, t2: 2882920306
thread: 1442, c1: 611, t1: 1892, t2: 1438788196
thread: 1443, c1: 611, t1: 1892, t2: 8975372635
thread: 1444, c1: 611, t1: 1892, t2: 651138124
thread: 1445, c1: 611, t1: 1892, t2: 5991849309
thread: 1446, c1: 611, t1: 1892, t2: 4781668354
thread: 1447, c1: 611, t1: 1892, t2: 5200938966
thread: 1448, c1: 611, t1: 1892, t2: 8272737768
thread: 1449, c1: 611, t1: 1892, t2: 10302483083
thread: 1450, c1: 611, t1: 1892, t2: 7919937159
thread: 1451, c1: 611, t1: 1892, t2: 11364765771
thread: 1452, c1: 611, t1: 1892, t2: 315029580
thread: 1453, c1: 611, t1: 1892, t2: 2165642996
thread: 1454, c1: 611, t1: 1892, t2: 11004052995
thread: 1455, c1: 611, t1: 1892, t2: 10007071767 ** a really high value
thread: 1456, c1: 611, t1: 1892, t2: 14798880 ** a really low value
thread: 1457, c1: 611, t1: 1892, t2: 9342330561
thread: 1458, c1: 611, t1: 1892, t2: 8624928097
thread: 1459, c1: 611, t1: 1892, t2: 4071838715
thread: 1460, c1: 611, t1: 1892, t2: 6361232441
thread: 1461, c1: 611, t1: 1892, t2: 7079097951
thread: 1462, c1: 611, t1: 1892, t2: 3706127561
thread: 1463, c1: 611, t1: 1892, t2: 5604827582
thread: 1464, c1: 611, t1: 1892, t2: 1035806602
thread: 1465, c1: 611, t1: 1892, t2: 1749129249