I have encountered an optimization level issue with the CUDA compilation trajectory where I can only use either -O0 or -O1 optimization levels for correctly running code. Use of -O2 and the default optimization level (which I’m guessing is -O3) results in an runtime kernel launch failure, i.e.
[ERROR] Sol_MultigridPressure3DDeviceD_relax(16) - CUDA error "unspecified launch failure" [ERROR] kernel_apply_3d_boundary_conditions_level1_nocorners(8) - CUDA error "invalid resource handle" [ERROR] Sol_MultigridPressure3DDevice::apply_boundary_conditions - failed at level 1 [ERROR] Sol_MultigridPressure3DDevice::relax - failed at level 1
I also have to explicitly pass optimization flags to both the ptxas and nvopencc compilation components in addition to nvcc in order to have correctly running code.
I have looked at the “The CUDA Compiler Driver - NVCC” manual supplied with the CUDA toolkit and there is little to no description about how different optimization levels affect code generation by nvcc, ptxas, and nvopencc.
I would appreciate any assistance in understanding this issue better and if trying to get the -O2 and greater optimization levels to work would result in additional performance improvement (-O0 and -O1 result in the same performance).
PS I have heard reports that these launch failures do not occur when running with a Tesla C1060 vs. my GTX 260 Maxcore. I have not experienced this type of runtime failure for any of the examples supplied with the SDK.
PPS I am working with the OpenCurrent package.