CUDA 7.5 on Maxwell 980Ti drops performance by 10x versus CUDA 7.0, and 6.5

When running the same binary on the same card, the MCX results are almost always reproducible when the PRNG seed is set (by the second line in the input file, or use -E flag).

the difference you noticed (also the absorption fraction at the bottom) is, as you mentioned, likely related to the relaxed math functions in different versions of cuda. This is expected.

I also notice that, when launching the same binary on different GPUs (980Ti vs 590), even I use the same RNG seed, same photon number, same grid and block size, the results are not identical (except that launching on either cores of the 590 produces identical result). I assume this is related to separately generated assembly for different architectures, and perhaps the hardware differences.

and yes, by default, the PRNG seed (seeding the CPU RNG, subsequently seeding the GPU RNG) is a fixed number

https://github.com/fangq/mcx/blob/master/src/mcx_utils.c#L99

nonetheless, I believe the reproducibility is not guaranteed by the CUDA specs (right?) because floating-point additions, used to produce mcx output, is not associative. so it may disappear in the future?

I’m able to reproduce the 10x difference now. Essentially all the difference is in execution time for the mcx_main_loop kernel which is ~1.5s in the fast case and ~20s in the slow case.

A quick look at the sass suggests approximately the same instruction count in each kernel (so, for example, no major differences in instruction generation or loop unrolling that could explain 10x). I’ve been poking around a few metrics that could also explain 10x, and it may take some time to figure out which are indicative of the underlying difference. The first few memory metrics I looked at show no difference in the slow vs. fast case (e.g. gld_efficiency). Your global load efficiency at ~3% is abysmal, by the way. You have perfectly uncoalesced read access, on a large scale.

Anyway as time permits I will continue to poke around at it, and may file an internal bug at some point. You’re welcome to file a bug also (did you say you had done that already?)

terrific! I am glad you are able to reproduce, and help us understand this behavior.

yes, I know. This is rather a limitation tied to the nature of the Monte Carlo algorithm itself. All photons are doing random walk in a 3D volume, so read/write of the voxels are completely random and uncorrelated.

Nonetheless, based on the PC sampling profiler, the memory dependency is only 3% of the total latency (see the pie chart in https://devtalk.nvidia.com/default/topic/925630/cuda-programming-and-performance/cuda-7-5-on-maxwell-980ti-drops-performance-by-10x-versus-cuda-7-0-and-6-5/post/4841940/#4841940). So, I believe the global memory latency is well hidden by launching many threads and the high algebraic density of the kernel (it is a compute-bound kernel after all).

Almost identical code but a 10x difference in execution time? Is it possible a loop is involved here that for some reason (maybe small numerical differences somewhere) iterates for a vastly different number of iterations? I’ll take a look at the source code again. [Later:] Looking at the main loop, it seems to be controlled by the results of the extensive and complicated floating-point computations in side the loop. It therefore seems that vastly different iteration counts based on small differences in those intermediate results are a least theoretically possible.

My only alternate hypothesis I can come up with would be a machine-specific code generation issue involving the Maxwell control words inserted between groups of actual instructions.

I didn’t mean to suggest that it was almost identical code. I simply looked at the sass and counted instructions, they were approximately the same. I haven’t studied the sass.

There is a big while-loop in the kernel, and of course that got my attention for exactly the reason you describe - it could be the “same code” with wildly different iterations based on some non-obvious factor like a floating-point comparison. Anyway, a quick instrumentation of that revealed that the actual trip counts per thread, while not identical between the two versions, were all in the same ballpark across all 256 (blocks) x 64 (threads per block) - around 120000 +/ 30000 across all threads. Not enough by itself to explain 10x.

Here are some additional metrics I have gathered so far. The ones I have marked 10x may be clues:

metric:				          mcx_65-75			mcx_75		
regs 				                69			68		
gld_efficiency  				3%			3%		
gst_efficiency 				        33%			33%		
sm_efficiency  				        99%			overflow		
inst_executed 				      4.50E+10			6.90E+11		10x
inst_replay_overhead				0			0		
dram_read_transactions 			       3368778			15973566		5x
dram_write_transactions 		      139804676			88625718		0.6x
shared_efficiency 				32%			3.50%		       10x
while loop counts 				~120000			~120000		
shared_load_transactions			2181605528		1.97E+10		10x
shared_load_transactions_per_request		  2.04			1.07

I believe the next thing I want to look at is instruction hotspots in nvvp for the kernel.
mcx_65-75 is my version built using CUDA 6.5 nvcc but linked against CUDA 7.5 CUDART, whereas mcx_75 is the CUDA 7.5 version. I did this oddness just for ease of profiling in my setup.

Kudos, that’s some interesting detective work going on in this thread. Maybe this will uncover some bug in the toolkit - or some obscure bug in the implementation.

Some additional metrics I gathered were:

.				    	      mcx_65-75			     mcx_75		
		
	inst_per_warp				    89686592.8			1348325546		15x		
	warp_execution_efficiency				46%			3%		15x

The huge difference in these two metrics led me to look for highly variable behavior thread-to-thread, in terms of a difference between the number of active threads per warp. The indication from above is that some warps are executing for long periods of time with only one thread out of the warp (~3%) active, in the “slow” case. In the “fast” case, this is less evident, as indicated by the lower average instructions per warp executed, as well as the higher average number of active threads per warp (~50%)

My analysis so far shows that most of the execution time difference is associated with this section of the mcx_main_loop kernel:

if((mediaid==0 && (!gcfg->doreflect || (gcfg->doreflect && n1==gproperty[mediaid].w))) || f.t>gcfg->twin1){

              GPUDEBUG(("direct relaunch at idx=[%d] mediaid=[%d], ref=[%d]\n",idx1d,mediaid,gcfg->doreflect));
              if(launchnewphoton(&p,v,&f,&rv,&prop,&idx1d,&mediaid,&w0,&Lmove,(mediaidold & DET_MASK),ppath,
                  &energyloss,&energylaunched,n_det,detectedphoton,t,tnew,photonseed,media,srcpattern,idx,(RandType*)n_seed,seeddata))
                   break;
              continue;
          }

Each thread in the kernel runs somewhere around ~120000 times through the main while-loop in the kernel. Out of those total iterations, this if-statement is entered (if condition is satisfied) about 610 times per thread.

Measuring using clock64(), this block of code uses about 50,000,000 clocks, fairly constant (+/- 20%) across threads, in aggregate across the 120000 iterations, for the CUDA 6.5 case. Most of that clock usage occurs on the 610 times the if-statement is entered (obviously).

In the CUDA 7.5 case, the while loop iterations are approximately the same (~120,000), and the if-block is entered the same 610 times per thread, but the aggregate clock measurement is highly variable, from a low of around 20,000,000 clocks per thread, to a high of around 12,000,000,000 clocks per thread. These excursions to such a high number occur often enough across threads that many, if not most, warps are affected. These excursions are driving the huge difference in the two metrics I referenced earlier.

Presumably the main issue is the behavior of launchnewphoton, which is a fairly involved function. This function is actually called at another place, once per kernel launch, just prior to the while-loop. Timing for this particular call of the function for the CUDA 6.5 case also varies across threads, around 700 clocks at the low end to a few threads at 4500 clocks at the high end. The majority of threads are at ~1000 clocks +/- 30%. If I divide the aggregate time for the function call in the while loop by the number of times entered (610) I get about 65,000 clocks per call, so something is very different even in the CUDA 6.5 case, between the cost of this function before the while loop, and the cost in the while loop.

Looking at these same numbers for the CUDA 7.5 case, (again, there is huge variability with threads in the same warp) for the initial call, the timing varies across threads from a low of around 300 clocks to a high of almost 3000 clocks. Again, within the while loop the function is called 610 times per thread, and the worst case aggregate timing divided by 610 yields about an average of 20,000,000 clocks for a single call of the function.

Anyway there is significant variability in the launchnewphoton function. I’m not sure yet if it is a data-dependent loop variation (there are various loops in the function), some kind of compiler bug, or something else.

Here is some “raw” data. c1 is the number of times the if-statement body is executed. t1 is the clock64() timing for the first (prior to while-loop) executing of the launchnewphoton function. t2 is the sum aggregate of the total time for the section of code I show above across all iterations of the while loop.

For the CUDA 6.5 case:

thread: 1420, c1: 611, t1: 2859, t2: 49070096
thread: 1421, c1: 611, t1: 2859, t2: 42560849
thread: 1422, c1: 611, t1: 2859, t2: 50754031
thread: 1423, c1: 611, t1: 2859, t2: 44682728
thread: 1424, c1: 611, t1: 2859, t2: 43948369
thread: 1425, c1: 611, t1: 2859, t2: 41965275
thread: 1426, c1: 611, t1: 2859, t2: 44488303
thread: 1427, c1: 611, t1: 2859, t2: 48930834
thread: 1428, c1: 611, t1: 2859, t2: 44219806
thread: 1429, c1: 611, t1: 2859, t2: 52256397
thread: 1430, c1: 611, t1: 2859, t2: 50421419
thread: 1431, c1: 611, t1: 2859, t2: 49058472
thread: 1432, c1: 611, t1: 2859, t2: 50906550
thread: 1433, c1: 611, t1: 2859, t2: 52490246
thread: 1434, c1: 611, t1: 2859, t2: 43482271
thread: 1435, c1: 611, t1: 2859, t2: 50507306
thread: 1436, c1: 611, t1: 2859, t2: 50133962
thread: 1437, c1: 611, t1: 2859, t2: 49184269
thread: 1438, c1: 611, t1: 2859, t2: 44272766
thread: 1439, c1: 611, t1: 2859, t2: 41139068
thread: 1440, c1: 611, t1: 2600, t2: 53743146
thread: 1441, c1: 611, t1: 2600, t2: 53142962
thread: 1442, c1: 611, t1: 2600, t2: 40498249
thread: 1443, c1: 611, t1: 2600, t2: 47598617
thread: 1444, c1: 611, t1: 2600, t2: 48931713
thread: 1445, c1: 611, t1: 2600, t2: 47855590
thread: 1446, c1: 611, t1: 2600, t2: 51535408
thread: 1447, c1: 611, t1: 2600, t2: 50559446
thread: 1448, c1: 611, t1: 2600, t2: 46043539
thread: 1449, c1: 611, t1: 2600, t2: 47411201
thread: 1450, c1: 611, t1: 2600, t2: 46184805
thread: 1451, c1: 611, t1: 2600, t2: 53590791
thread: 1452, c1: 611, t1: 2600, t2: 43928864
thread: 1453, c1: 611, t1: 2600, t2: 51324045
thread: 1454, c1: 611, t1: 2600, t2: 47044995
thread: 1455, c1: 611, t1: 2600, t2: 39032208
thread: 1456, c1: 611, t1: 2600, t2: 39834598
thread: 1457, c1: 611, t1: 2600, t2: 49439931
thread: 1458, c1: 611, t1: 2600, t2: 45755023
thread: 1459, c1: 611, t1: 2600, t2: 45037481
thread: 1460, c1: 611, t1: 2600, t2: 46636916
thread: 1461, c1: 611, t1: 2600, t2: 48531856
thread: 1462, c1: 611, t1: 2600, t2: 47624660
thread: 1463, c1: 611, t1: 2600, t2: 49262769
thread: 1464, c1: 611, t1: 2600, t2: 50782732

CUDA 7.5 case:

thread: 1420, c1: 611, t1: 2746, t2: 7185675356
thread: 1421, c1: 611, t1: 2746, t2: 3768470065
thread: 1422, c1: 611, t1: 2746, t2: 5976327405
thread: 1423, c1: 611, t1: 2746, t2: 8327436533
thread: 1424, c1: 611, t1: 2746, t2: 11249497893
thread: 1425, c1: 611, t1: 2746, t2: 8007521379
thread: 1426, c1: 611, t1: 2746, t2: 5319583916
thread: 1427, c1: 611, t1: 2746, t2: 4492991330
thread: 1428, c1: 611, t1: 2746, t2: 10530867654
thread: 1429, c1: 611, t1: 2746, t2: 11564183908
thread: 1430, c1: 611, t1: 2746, t2: 6381597042
thread: 1431, c1: 611, t1: 2746, t2: 1806753661
thread: 1432, c1: 611, t1: 2746, t2: 2967708785
thread: 1433, c1: 611, t1: 2746, t2: 2533653114
thread: 1434, c1: 611, t1: 2746, t2: 10199807708
thread: 1435, c1: 611, t1: 2746, t2: 9802474834
thread: 1436, c1: 611, t1: 2746, t2: 4096053944
thread: 1437, c1: 611, t1: 2746, t2: 10869798193
thread: 1438, c1: 611, t1: 2746, t2: 2189946079
thread: 1439, c1: 611, t1: 2746, t2: 5658245992
thread: 1440, c1: 611, t1: 1892, t2: 7462091381
thread: 1441, c1: 611, t1: 1892, t2: 2882920306
thread: 1442, c1: 611, t1: 1892, t2: 1438788196
thread: 1443, c1: 611, t1: 1892, t2: 8975372635
thread: 1444, c1: 611, t1: 1892, t2: 651138124
thread: 1445, c1: 611, t1: 1892, t2: 5991849309
thread: 1446, c1: 611, t1: 1892, t2: 4781668354
thread: 1447, c1: 611, t1: 1892, t2: 5200938966
thread: 1448, c1: 611, t1: 1892, t2: 8272737768
thread: 1449, c1: 611, t1: 1892, t2: 10302483083
thread: 1450, c1: 611, t1: 1892, t2: 7919937159
thread: 1451, c1: 611, t1: 1892, t2: 11364765771
thread: 1452, c1: 611, t1: 1892, t2: 315029580
thread: 1453, c1: 611, t1: 1892, t2: 2165642996
thread: 1454, c1: 611, t1: 1892, t2: 11004052995
thread: 1455, c1: 611, t1: 1892, t2: 10007071767     ** a really high value
thread: 1456, c1: 611, t1: 1892, t2: 14798880        ** a really low value
thread: 1457, c1: 611, t1: 1892, t2: 9342330561
thread: 1458, c1: 611, t1: 1892, t2: 8624928097
thread: 1459, c1: 611, t1: 1892, t2: 4071838715
thread: 1460, c1: 611, t1: 1892, t2: 6361232441
thread: 1461, c1: 611, t1: 1892, t2: 7079097951
thread: 1462, c1: 611, t1: 1892, t2: 3706127561
thread: 1463, c1: 611, t1: 1892, t2: 5604827582
thread: 1464, c1: 611, t1: 1892, t2: 1035806602
thread: 1465, c1: 611, t1: 1892, t2: 1749129249

Great work! If this ultimately turns out to be a worked example of the butterfly effect (with the compiler’s code generation serving as the butterfly), it would be the most extreme example I have seen to date.

I note that the function in question contains a large loop which in turn contains a fairly complex thicket of branches. I wonder whether the difference in the divergent code flows observed during profiling could be due to differences in the placement of merge points (SSY, .S) by PTXAS. In particular, there may be a merge point missing in the CUDA 7.5 built executable that is present in the CUDA 6.5 built executable. The creation of a near-optimal set of merge points is a hard problem with fairly complex code such as this, so bugs could occur.

thanks txbob for the insightful analysis. this narrows down things quite a bit.

the launchnewphoton() does look a bit scary in the first glance, but users source settings (passed through the constant memory variable gcfg->srctype) are already fixed during the kernel run time, the execution path is actually quite straightforward (and divergence-free): in this benchmark, the entire if/elseif/else block is by passed because the first condition is matched. the do{}while() loop also executes exactly once because *mediaid=1 (copied from constant memory var gcfg->mediaidorig).

I dig into the launchnewphoton() function a little bit deeper, and found the culprit was with this if block:

https://github.com/fangq/mcx/blob/master/src/mcx_core.cu#L436-L442

if I change the if condition from

if((*mediaid & MED_MASK)==0){

to

if(0 && (*mediaid & MED_MASK)==0){

the speed on cuda 7.5 immediately returns to normal (18000 p/s).

to verify that this block should not be involked, I inserted a printf right below the if() condition:

if((*mediaid & MED_MASK)==0){
     printf("mediaid=%d\n",*mediaid & MED_MASK);
     ...
}

when running the above modification, the simulation speed returns to the slow one (1200 p/s), yet, the message is never printed. This suggests that, despite the condition was never satisfied, this condition somehow influenced the branch predication decisions in CUDA 7.5’s nvcc, which ended up with generating redundant instructions (and executing them).

Nice catch. The if-statement in question guards a call to skipvoid(). This function seems much too large and complex that it could be inlined and fully predicated. The CUDA compiler uses heuristics to determine when full predication helps, and from looking at lots of generated code, for conditional code sections larger than six or seven instructions it typically retains the branch. In this case it would have to predicate hundreds of instructions that comprise skipvoid(), that seems very unlikely if not impossible due to nested conditionals. I suspect some other mechanism is at play here.

I would strongly suggest filing a bug report with NVIDIA so the CUDA compiler team can properly root cause this.

Yes, I had figured out that the do loop executes only once (~610x per thread), and I had connected the 610 number with 10,000,000 photons divided by 256*64 threads.

I had also discovered that timing of the launchnewphoton() function from within the function did not match the timing when wrapping the function call. This is a curious result. Function inlining doesn’t help when trying to dissect code this way. I’m not really sure how to connect all the dots yet. But your data point around changing the if condition certainly suggests the compiler is doing something pretty odd.

Anyway your datapoint probably narrows things down enough that the compiler team should be able to chew on it with reasonable efficiency.

I’ll file a bug at some point soon. If you want continued feedback/communication about status, I suggest you file your own bug as well.

that’s very curious. something must have been added to account for 10x more instructions executed.

ok, I will file one from my side as well.

thanks a lot for looking into this. On the other hand, if you see anything we could optimize for making the code more efficient, we’d love to hear. This project is actually funded by the NIH and I am obligated to make the code more efficient.

As stated above, my guess at this point is a problem with placing merge / convergence points (insertion of SSY and .S). This is not visible at the PTX level, it is a low-level mechanism inserted into SASS by PTXAS, and it is basically an optimization.

While the use of merge / convergence points is not required for functional correctness, it is important for performance to avoid the potential effect of divergence causing a single thread to run all the way to completion before the thread-mask stack is finally popped and other threads get to run. I recall at least one bug related to SSY/.S placement in the past that affected code with a structure similar to the code considered here (nested loops with plenty conditionals inside). The performance degradation in that case was also dramatic, although not 10x.

It sure would be interesting to get feedback eventually as to what the underlying reason for this issue was.

The internal bug I filed is 1747451

You may wish to reference that bug number in the bug you file.

I would also report a related finding regarding the OpenCL version of the code. Maybe I should start a new thread, but let me briefly describe the problem first.

I wrote an OpenCL version of mcx, called mcxcl. For Maxwell, it has the same issue as the CUDA version - the running speed is much slower than what it was before, and even slower than Fermi/Kepler cards on my system.

However, we found out that, by turning on a flag (-d 1) from the command line, the mcxcl simulation speed can be improved by 10x.

This is quite puzzling, because when a user sets “-d 1”, mcxcl appends “-D MCX_SAVE_DETECTORS” to the JIT compiler option, see

https://github.com/fangq/mcxcl/blob/1a499869462b72760163d96975a34d48cc966d6f/src/mcx_host.cpp#L358-L359

if you inspect the cl kernel, defining the MCX_SAVE_DETECTORS macro enables 5 to 6 additional code blocks

https://github.com/fangq/mcxcl/blob/1a499869462b72760163d96975a34d48cc966d6f/src/mcx_core.cl

this means the cl kernel is more complex and more computation is needed for the additional photon detection calculations/storage. So, I expect the code to be slower, and could not imagine it can become 10x faster!

If you want to test this, here is the test sequence: you can use any versions of CUDA, and you need to have a Maxwell to reproduce

git clone https://github.com/fangq/mcxcl.git 
cd mcxcl
git checkout 1a499869462b72760163d96975a34d48cc966d6f .
cd src
make clean
make                     # compile mcxcl binary
cd ../example/quicktest
./listgpu.sh             # list all available GPUs
./run_qtest.sh           # run code using the 1st GPU (-G 1), use 01 mask string to select GPU

for my GTX 980Ti, the simulation speed is pretty low, similar to the CUDA case, around 1400 p/s.

However, appending “-d 1” to the command in run_qtest.sh, you run this command instead

../../bin/mcxcl -t 16384 -T 64 -g 10 -n 1e7 -f qtest.inp -s qtest -r 1 -a 0 -b 0 -k ../../src/mcx_core.cl -d 1

on my 980Ti, the speed increases to ~17,000 photon/ms, also similar to the CUDA case.


just to provide an additional reference for speed: I compiled mcxcl on a Ubuntu box running CUDA 6.5, and run the same benchmark on a 980 (not Ti), I got 22,000 photon/ms with “-d 0” and 16,000 with “-d 1”. This result makes perfectly sense. However, comparing with those on 980Ti, I expect 980Ti outperform 980 by ~10%. So, I think the 17,000 p/s with -d 1 on 980Ti looks proportional to the speed on 980 (16,000); the broken case was when setting “-d 0” on 980Ti (1400 p/s. likely dropped 17x from 24000 p/s).

However, this time, it appears that something inside the “#ifdef MCX_SAVE_DETECTORS … #endif” blocks influenced the branch predication. but more complex code seems to help the heuristics to make better code generation.

txbob, do you want to include this additional finding to your bug report as well? or you think a separate bug report is more appropriate?

I don’t see enough data to connect the two. The only thing I can surmise is that there is what I call “fragile” code generation going on (in both cases). Essentially a repeat of what njuffa has said a couple times now. I think njuffa has more experience in this area than I do. I’ve seen maybe one or 2 examples of “fragile” code generation in my experience. This might be a third case. But that’s merely conjecture. I cannot connect the two cases based on what you’ve said.

I would suggest filing a separate bug for the OpenCL case, and if you wish, reference your CUDA bug for the issue discussed in this thread (or mine). That will provide enough context to connect the two issues if need be. It should result in less cluttered reports anyway, as the code bases and exact repro steps are separate anyway (although, perhaps similar).

Thanks for your patience with us and with this issue. Thanks for taking the time to help unravel it.

While I do have a extensive experience with diving into the details of SASS code trying to pinpoint compiler bugs (or else proving the issue is with the source code), I no longer enjoy the benefits of discussing such issues at length with the CUDA compiler engineers. I’d say much of my knowledge in this area is “dated” at this point, possibly even “outdated”.

If the issue in the MCX code is one of SSY/.S placement (a mere conjecture at this point) it is probably not an issue of “fragile” code generation, just a very hard problem for the compiler to solve, and it may just so happen that adding one more branch triggers a performance cliff in very rare cases.

The placement issue is hard because once the compiler starts placing convergence points (as tightly as possible, to avoid lengthy divergent flows), it also needs to traverse all the possible call graphs to make sure the thread-mask stack comes out correctly on all possible paths to that convergence point. If the code has instances of ‘break’ and ‘continue’ (really ‘goto’ in disguise) that can make it extra hard.

Yes, this code has break and continue in various places, as well as the use of return statements from various points conditionally within a function.

The two cases I can remember previously where I argued with the compiler engineers were just such cases as well.

I am totally open to the idea of optimizing MCX coding styles so that the compiler heuristics can easily generate highly efficient instructions (of course, the compiler team may use MCX as a benchmark to enhance robustness of handling large monolithic complex kernels). I am also wiling to be educated to learn techniques that can ease the predication process for the compiler (if I can’t understand, I am sure Fanny will).

During the early development cycles of this software (circa 2009), I found using a while-loop construct and a for-loop construct made huge difference in terms of speed.

https://github.com/fangq/mcx/blob/fc7963a53c7d918de65e484242ffa54ae358a61f/src/mcx_core.cu#L150-L157

but this difference diminished in newer versions of the toolkit. I was under the impression that the code complexities presented in MCX was well taken care of by the compiler. A “fragile code generation” issue has no longer been an issue until this Maxwell/cuda 7.5 issue showed up.

The first thing I would like to learn from you guys is that: through what mechanism can an inefficient code generation negatively impact the speed? does it impact through consuming more registers? does it impact though increasing the instruction size and instruction loading overhead?

My second question, is there a fundamental difference between a for-loop and while-loop in code generation? what about variable-limit for-loops? can a while-loop be unrolled? how does break/continue influence the loop code generation?

My other question is related to JIT. I understand OpenCL uses JIT, and I heard that part of nvcc code generation also uses JIT. However, the JIT compilation happens before users initialize the constant variables, which may contain crucial information to signify the enabling and disabling of large code blocks (which can make a substantial impact to complexity and code generation). So, in either the case of OpenCL or CUDA compilation, does the compiler use the constant memory values at all to simplify code generation? if not, what was the difficulty? or is there a way we can hint the heuristics?

In general, I advise against massaging code to be more palatable to any particular compiler (CPU or GPU), because the resulting optimization is very fragile. Every new version of the compiler will change some of the interaction between the heaps of heuristics inside the code generating phases and the previously favored idiom may now become disadvantageous. Instead, I advocate writing code in a clear, straight-forward, manner and reporting any resulting inefficiencies to the compiler vendor.

This isn’t just a theoretical consideration: For example, to get the best performance for the CUDA math library, I would often massage the source code to be more palatable to the compiler, which required a lot of time for dissecting and studying the generated SASS code to come up with the winning combination, as there is no general recipe. Overall, for investigating functional bugs and performance issues (both for in-house and customer code), I would claim (without boasting) that I probably looked at more SASS code in detail than any particular CUDA compiler engineer.

However, my approach to the math library source code required numerous re-writes over the years and made the code difficult to read in places. If someone is desperate for performance (and cannot wait for the compiler to improve), they should by all means look into massaging source code, using inline PTX assembly, maybe even code SASS assembly code with Scott Gray’s Maxwell assembler. But it is not sound software engineering in my thinking, as lack of code readability and increased code maintenance has a definite long-term cost.

For general optimization strategies, the CUDA Best Practices Guide is an excellent starting point. In terms of overall code structure, prefer “single-entry, single-exit” constructs. This means avoiding use of ‘break’, ‘continue’, multiple ‘return’, which are all hidden uses of ‘goto’. Such irregular control flow interferes with many compiler optimizations, independent of the platform. One frequent source of execution inefficiencies (could be major or minor) in CUDA is not taking full advantage of CUDA’s extended set of math functions and device intrinsics (e.g. some programmers do not realize they have rsqrt(), sincos(), rhypot() etc. at their disposal).

As for loops, I am not aware of any particular pros and cons for the three basic loop types, other than that integer-counted for-loops probably make unrolling easier and more likely. I say “probably” because I have not actually researched this, it has never come up in my CUDA performance work.

As for JIT compilation: CUDA can now JIT compile from source code. Since the beginning of CUDA, it can JIT compile from PTX representation. I usually advise against using this, unless dynamic code generation is a crucial technique for a particular use case. My advice is to use offline compilation in such a way that SASS (machine code) for all architectures of interest is embedded in the object file, plus one copy of PTX for the most recent architecture. The latter serves as an insurance policy for future GPU architectures on which the PTX code can be JIT compiled when it first arrives.