I have encountered an optimization level issue with the CUDA compilation trajectory where I can only use either -O0 or -O1 optimization levels for correctly running code. Use of -O2 and the default optimization level (which I’m guessing is -O3) results in an runtime kernel launch failure, i.e.
[ERROR] Sol_MultigridPressure3DDeviceD_relax(16) - CUDA error "unspecified launch failure"
[ERROR] kernel_apply_3d_boundary_conditions_level1_nocorners(8) - CUDA error "invalid resource handle"
[ERROR] Sol_MultigridPressure3DDevice::apply_boundary_conditions - failed at level 1
[ERROR] Sol_MultigridPressure3DDevice::relax - failed at level 1
I also have to explicitly pass optimization flags to both the ptxas and nvopencc compilation components in addition to nvcc in order to have correctly running code.
I have looked at the “The CUDA Compiler Driver - NVCC” manual supplied with the CUDA toolkit and there is little to no description about how different optimization levels affect code generation by nvcc, ptxas, and nvopencc.
I would appreciate any assistance in understanding this issue better and if trying to get the -O2 and greater optimization levels to work would result in additional performance improvement (-O0 and -O1 result in the same performance).
Thanks,
dpe
PS I have heard reports that these launch failures do not occur when running with a Tesla C1060 vs. my GTX 260 Maxcore. I have not experienced this type of runtime failure for any of the examples supplied with the SDK.
Somebody just pointed me at this–if you could post a self-contained repro case, I can send it to the compiler team. -O2 and -O3 shouldn’t break anything; this seems like a bug. However, getting OpenCurrent to build will probably take me a fair amount of time, so a self-contained repro case would be extremely helpful.
I think you should look at the issue report I have linked to in the posting. It would be difficult to provide you with an entirely self-contained issue (repro) case. since that would practically require me to repackage and send to you the entire OpenCurrent software package. I do agree that this problem is runtime related and is a bug in the compilation system and possibly with how the CUDA trajectory components interact with the hardware texture units.
I have found that getting OpenCurrent to build is fairly easy along with running the unit tests, please follow the instructions provided with project. Some modification was required to get all of the unit tests to run on my GTX 260 Maxcore because of less device memory than the testing which apparently was for a Tesla C1060-like system. The OpenCurrent project maintainer may be able to provide some additional assistance.
I think the biggest difficulty may be getting it to run on the hardware I have (GTX 260 Maxcore OC).
I also believe that NVIDIA can assist developers by better explaining what do the various optimization levels do with regards to parallel code generation and register file usage. This device is inherently parallel in nature being a SIMT-based system and programmers who do not understand the intricacies of parallel programming can inadvertently create race conditions, etc. The next generation Fermi products are going to allow multiple concurrent kernel execution, making reasoning about the correctness of the code produced by programmers that much more difficult. See the Fermi White Paper (pg. 18).
I agree - a list of compiler optimizations could be very useful in general, since it would take some of the guesswork out of optimizing. If you know the compiler is going to perform a specific optimization, it lets you better spend optimizing time on other parts of code. In addition, understanding the optimization that the compiler does allows you to design cleaner code by using the knowledge of how the compiler can restructure certain constructs.
In short, a good knowledge of the inner workings of the optimizer can help a good deal with productivity, since you don’t have to stab in the dark to figure out what the optimizer is actually doing.