What do optimization levels do?

dpephd · December 31, 2009, 8:49pm

I have encountered an optimization level issue with the CUDA compilation trajectory where I can only use either -O0 or -O1 optimization levels for correctly running code. Use of -O2 and the default optimization level (which I’m guessing is -O3) results in an runtime kernel launch failure, i.e.

[ERROR] Sol_MultigridPressure3DDeviceD_relax(16) - CUDA error "unspecified launch failure"

[ERROR] kernel_apply_3d_boundary_conditions_level1_nocorners(8) - CUDA error "invalid resource handle"

[ERROR] Sol_MultigridPressure3DDevice::apply_boundary_conditions - failed at level 1 

[ERROR] Sol_MultigridPressure3DDevice::relax - failed at level 1

I also have to explicitly pass optimization flags to both the ptxas and nvopencc compilation components in addition to nvcc in order to have correctly running code.

I have looked at the “The CUDA Compiler Driver - NVCC” manual supplied with the CUDA toolkit and there is little to no description about how different optimization levels affect code generation by nvcc, ptxas, and nvopencc.

I would appreciate any assistance in understanding this issue better and if trying to get the -O2 and greater optimization levels to work would result in additional performance improvement (-O0 and -O1 result in the same performance).

Thanks,

dpe

PS I have heard reports that these launch failures do not occur when running with a Tesla C1060 vs. my GTX 260 Maxcore. I have not experienced this type of runtime failure for any of the examples supplied with the SDK.

PPS I am working with the OpenCurrent package.

tmurray · January 11, 2010, 11:22pm

Somebody just pointed me at this–if you could post a self-contained repro case, I can send it to the compiler team. -O2 and -O3 shouldn’t break anything; this seems like a bug. However, getting OpenCurrent to build will probably take me a fair amount of time, so a self-contained repro case would be extremely helpful.

dpephd · January 12, 2010, 5:52pm

Hi tmurray,

I think you should look at the issue report I have linked to in the posting. It would be difficult to provide you with an entirely self-contained issue (repro) case. since that would practically require me to repackage and send to you the entire OpenCurrent software package. I do agree that this problem is runtime related and is a bug in the compilation system and possibly with how the CUDA trajectory components interact with the hardware texture units.

I have found that getting OpenCurrent to build is fairly easy along with running the unit tests, please follow the instructions provided with project. Some modification was required to get all of the unit tests to run on my GTX 260 Maxcore because of less device memory than the testing which apparently was for a Tesla C1060-like system. The OpenCurrent project maintainer may be able to provide some additional assistance.

I think the biggest difficulty may be getting it to run on the hardware I have (GTX 260 Maxcore OC).

dpe

dpephd · January 13, 2010, 5:44pm

I also believe that NVIDIA can assist developers by better explaining what do the various optimization levels do with regards to parallel code generation and register file usage. This device is inherently parallel in nature being a SIMT-based system and programmers who do not understand the intricacies of parallel programming can inadvertently create race conditions, etc. The next generation Fermi products are going to allow multiple concurrent kernel execution, making reasoning about the correctness of the code produced by programmers that much more difficult. See the Fermi White Paper (pg. 18).

Keldor314 · January 13, 2010, 7:54pm

I agree - a list of compiler optimizations could be very useful in general, since it would take some of the guesswork out of optimizing. If you know the compiler is going to perform a specific optimization, it lets you better spend optimizing time on other parts of code. In addition, understanding the optimization that the compiler does allows you to design cleaner code by using the knowledge of how the compiler can restructure certain constructs.

In short, a good knowledge of the inner workings of the optimizer can help a good deal with productivity, since you don’t have to stab in the dark to figure out what the optimizer is actually doing.

Topic		Replies	Views
NVCC optimize level CUDA Programming and Performance	0	5527	November 6, 2009
How to specify optimization level in device code? CUDA Programming and Performance	1	682	January 6, 2012
Segmentation Fault -- Possibly Linked to Optimisation Level? CUDA Programming and Performance	0	5304	February 9, 2011
Is O3 always good option in nvcc ? compiling with nvcc, when there is no error at least... CUDA Programming and Performance	2	19025	July 6, 2011
Problems when I use -O3 CUDA Programming and Performance	12	7356	February 11, 2010
Optimization options in CUDA CUDA Programming and Performance	1	2722	March 5, 2010
Using --optimize or -O with NVCC Looking for documentation CUDA Programming and Performance	2	8365	November 9, 2011
nvcc -O0 not working (CUDA 3.2) CUDA Programming and Performance	9	16504	December 16, 2011
CUDA 2.3 bug? Strange compilation issue CUDA Programming and Performance	0	1860	September 5, 2009
Optimization and Debugging info Legacy PGI Compilers	1	1903	August 3, 2015

What do optimization levels do?

Related topics