My project builds fine, and tests/ runs just splendid in the debugger
However, the only way I can get the very same project to run its release build properly, is if I add the -G (generate device debug information) flag; this is regardless of the optimization level I select, and without the -g (generate host debug information) flag
Why would this be?
Without any information about your code or what you mean by “run properly”, the most likely cause is a bug in your code. Here are some possible scenarios (there are others):
Use of -G changes the generated machine code somewhat, possibly by disabling some code transformations. With those changes to the generated code, your code happens to work (for now).
Like (1), but use of -G leads to different resources usage, such that a kernel launch that was failing due to an out-of-resources condition now is able to run.
The use of debug information causes different data to be stored (or the same data in different places), and that data may be picked up when code has out of bounds accesses or operates on uninitialized data. Depending on what data is picked up, the code may appear to work, or may fail.
You would want to first do all the normal sanity checks: (a) are all API calls and all kernel launches error checked? (b) run the application under cuda-memcheck with both race checker and out-of-bounds checker.
For future reference, is the utility/ effect of the -G flag documented somewhere?
[url]http://docs.nvidia.com/cuda/pdf/CUDA_Compiler_Driver_NVCC.pdf[/url], which is the NVCC manual contains some basic information on the flag as accepted by NVCC:
–device-debug -G Generate debug-able device code.
And then for PTXAS (that’s the backend that translates PTX into machine code):
–device-debug -g Semantics same as nvcc option --device-debug.
–sp-bound-check -sp-bound-check Generate stack-pointer bounds-checking code sequence. This option is turned on automatically when device-debug(-g) or opt-level(-O) 0 is specified.
So this tells us one specific difference (stack-pointer bounds-checking) when -G is used. I would hazard a guess that there are probably many other differences. The sure way to find out is to inspect the generated machine code with cuobjdump --dump-sass
Running memcheck and racecheck on the project’s release build reveal no errors
Taking a step back: the project has a clear user interface side, and then the algorithm side
I have split the project code in 2 parts by placing the algorithm side in a shared library (with deliberate intent)
Only the algortihm side - shared memory - utilizes the gpu - that is, contains gpu code and apis
When I leave all code in the project - when I do not use a shared library - the project’s release build works just fine, without the need for the -G flag
When I move the very same algorithm section to a shared libray - when I make use of the shared library - the project’s debug build works just fine (in the debugger), but the project’s release build only works fine with the -G flag
Subsequently running memcheck and racecheck on the project when using the shared library reveals no errors
Working just fine means the gpu side algorithm kernel and its functions execute as they should, with evidently proper results/ outcomes
I still need to test what happens when I put the code in a static library instead
Just brainstorming here: Differences between statically linked vs dynamically linked code would be expected in the host (CPU) code, not in the CUDA kernels. In particular, x86 code for DLLs/DSOs is typically built as position-independent code or PIC. If you are on Linux, you would see -fPIC passed to the host compiler.
Since it is not clear how exactly things go wrong in the failing case, it seems entirely possible that the CUDA kernels are being passed “bad” data in the failing case. One thing you could try is run valgrind (or an equivalent tool on Windows) to check whether the host code uses uninitialized data or has out-of-bounds accesses.
My memory is very hazy, but I seem to recall that the use of global variables can cause trouble in the context of DSOs/DLLs. I forgot what the error mechanism was: either one can inadvertently wind up with multiple copies of the global variable, or there are issues with initializing the global variable.
One thing you might want to try is log all data sent to GPU kernels to see whether there are any differences between the passing and the failing case. If differences are found, follow them back to the point of origin.
To be clear, I am not excluding the possibility that there could be an issue with the CUDA software stack here, but based on the information presented so far, I see no direct evidence of that, and it seems much less likely than an issue inside the application.
It seems that I have been wrong: I did not clear the project properly; thus, there is no real difference between the cuda part moved to a shared/ static library, and the cuda part contained within the overall project
For the release build of the project to work, the -G flag is required, regardless of the -g flag and the level of optimization, and regardless of whether the cuda part is in a shared/ static library or not
njuffa, you have now mentioned a number of possible causes:
a) erroneous (out of bounds) memory access
b) the actual (different) values/ data passed to the kernel
c) code translation differences when the -G flag is present/ absent
memcheck and racecheck reveal no errors, when the -G flag is set, and the project actually runs
memcheck and racecheck show no errors, up to the point that the project “fails”, when the -G flag is absent
The nvidia x server shows activity on the gpu, when the -G flag is absent
I can track the program’s execution up to the 1st kernel call, when the -G flag is absent
cudaGetLastError() does not return errors
I believe the program hits the 1st kernel call, launches the kernel successfully, but the kernel itself fails to execute as it should (it seems as if it enters an infinite loop, or even ‘undefined behaviour’ as is characteristic of gpus)
In turn, I would think that this is caused by a) and/ or b) - wrong data being passed, and or different code translation
Yet, a) - wrong data - seems implausible; I have meticulously debugged the project in the debugger, and I would want to believe that the debugger/ debugging can at least validate the data being passed
Perhaps when a library is present, wrong data may be a stronger possibility, but this is not the case here
If I am pressed for time, would I be severely punished if I leave the -G flag intact for now, until I figure out what the cause is?
I can set optimization at level 3, and the program works, as long as the -G flag is present
I have now tested the wrong data input hypothesis, and the difference in code hypothesis
By first of all shorting the 1st kernel, I can get the program past the 1st kernel
By generating all data within the kernel, such that the kernel is essentially data independent, I still can not get the program past the 1st kernel
This tells me that the problem is likely not data-related
Comparing assembly code with and without the -G flag, shows significant differences in the actual code (this is an understatement really)
I have opted for a design where the 1st thread of the kernel or kernel block determines the execution path, rather than the host, to refrain from continuously getting on and off the device, to first consult the host, to execute a kernel function
Thus, in many cases the 1st thread controls the kernel execution
The slightest change in execution by the 1st thread can easily throw the complete kernel in disarray
I have now come to the general conclusion that the compiler introduces execution error through ‘incorrect’ compilation, when the -G flag is not present to restrict it
I am not surprised the code is very different with -G. If you examine machine code generated at full optimization, you will often find the machine instructions for a single line of source code spread out over many discontinuous chunks, the same variable assigned to N different registers, code from multiple inlined functions all mixed together, etc. In order to make the code debuggable, allow tracing of variables and execution matching to source lines the compiler probably has to dial down most optimizations.
From the description of the code it is impossible to determine whether it might violate the CUDA programming model in some way. It seems you have become convinced this is a code generation issue. If so, consider filing a bug via the form linked from the registered developer website, attaching the smallest self-contained repro program possible.
njuffa, you have been a tremendous help thus far; thanking you
When I sit down and think about this, I am quite perplexed
One argument may be that it is the compiler’s fault, for clearly over-optimizing
Another argument is that this is a programming fault, as the code does not fully accommodate the compiler, and thus may violate programming models, as you point out
What further complicates the matter, is that it is very difficult to understand ‘where the compiler goes wrong’, in order to attempt to correct the issue - the code is lengthy, making it difficult to follow, and there may be multiple instances of ‘compiler wrong-doing’
What I do know, is that the code works to my expectation, with the -G flag present
Honestly, I do not think there is a easy resolve here
If I can be assured that -O0 implies no optimization, I think the way forward would be to declare the current build a version, and to work towards a minor version update with -O0 and without -G