In the CUDA samples, and my minimal test cases I can step through debug CUDA kernels fine now and correctly see the values of variables in the watch window.
In our main projects (that we cannot legally share right now) with more complicated kernels (and even simple ones in the same project) I cannot view the values of variables properly in the legacy debugger watch window which makes debugging a nightmare when we are trying to track issues down. All we can see is lots of errors in the watch window like:
“_ZN67_INTERNAL_45_tmpxft_00005044_00000000_7_test_cpp1_ii_0ee0ef6d6thrust12placeholders2_3E Could not resolve name”
That one is interesting as it pops up in breaking into a kernel that is not even using thrust! But where a kernel in the same file is using thrust? But basically all attempts to type known variable names into the CUDA warp watch window result in “Could not resolve name ‘XXXX’”
As far as we can tell the projects have exactly the same build settings as simple test projects that work fine regards the watch windows when debugging.
The only difference I can see between our main projects that have this problem and our tests right now is that we use two GPU’s (so CUDA is creating two contexts) in the main projects but just one so far in our tests.
And we just checked making our multi-GPU project use one GPU and still the same problem unfortunately. We can’t see any difference that could be causing this problem now.
Even pasting a kernel that we can properly watch from a simple test project that works across to these projects has the same problem. No idea why the tools are failing in this case.
I’ve tried introducing thrust and cub usage into simple test projects that work and they continue to work so far. I will try and remake from scratch the troublesome projects solution files when we have more time (but every setting is the same as the simple test projects that work). I’ve also tried reducing threads per block with no joy.
Is there anything that can be done to validate the symbols the cuda compiler creates to see if it is that?
Had a chance to look at this again today. Tried to set relocatable device code to yes and still the same issue. We haven’t been able to reliably debug cuda kernels for a very long time now.
I really don’t know how to help you now, I shall raise a bug for you, maybe you can wait for the nsight 6.0 and have a check, it will be released in the next few weeks.
Since you can’t share us your projects to see the specific issue currently ,we have no idea what we can do for your issue ,and the Watch view from our side works well.
Hope our new release Nsight 6.0 (planned around 9/20/2018) can help you.
Actually I just caught it with a new project transitioning from working properly to exhibiting the problem. I rolled back the source file in our source control and it started working again. All that had changed is one single .cu file where:
~5 more device and host functions were added
~3 more kernels were added
No header includes were changed
Some of the functions added were template ones used in some of the kernels
More kernel launches were added
No using namespaces were changed
More comments were added
Kernel launch parameters were changed to process much larger arrays (but I try changing the newer to use the older smaller sizes and still the same problem)
So it seems it’s nothing to do with project settings, but is specific to the source code in the file. Unfortunately yet again this project is using a load of code I am not currently at liberty to share right now. Previously I had tried to recreate the same problem with a simple test without those libraries but no joy so far - but due to that single source file change causing the problem I don’t think it’s specific to those.
Interestingly when I rename the name of some the kernels the errors in the watch window have changed from:
“Condition(false) in method: Void TypeCheckObjectName(Nvda.CppExpressions.FrontEnd.CppParseArguments)”
to
“Could not resolve name ‘xxxx’”
I renamed some of the kernels as a test just now as I noticed some of the device functions they call have in some cases the same name but different function signatures (so in C++ terms are valid different overloads).
This is a very simplified version of the first kernel launched in a sequence of kernels in the file change I isolated. If I breakpoint into the kernel on any line I cannot watch the variables as described, or in any of the subsequent kernels. As far as I can tell it executes and modifies the array of values correctly though. Enabling memory checking from Nsight flags no errors. Even if it was stomping over memory catching it at a breakpoint at the start should allow me to watch variables until it does.
Interestingly if I comment out the line with the block store I can then breakpoint into this kernel and subsequent kernels and watch variables without issue.
If I manage to find the time to get a full working extracted isolated example that I can share I will, but it’s a question of finding the time to track this down further, put in the work required and luck as I am not totally convinced of the cause yet.
Ok I extracted that exact same previously stated kernel into a separate test project and it now works and I can step through and watch it fine (Hence the problem of trying to create an isolated repo for you). So I have become more convinced that:
It’s most likely not a bug in our code, or a library such as CUB
It’s not a memory overwrite error as catching it in the debugger before any memory overwrites has the same problem in projects where it happens, and the memory checker shows no errors
You have some sort of bug in nsight or your generation process that creates the information it uses
It seems like whenever a CUDA project adds multiple kernels or additional device functions in a file this bug appears. But what exact arcane magic summons it I have no idea!
As you have seen I have taken one kernel that exhibited the problem, that when commented out in one project fixes it so all the kernels in that project can be watched properly. Which would indicate that the kernel has some problem - except there seems to be nothing wrong with the kernel, and when it is moved to a separate test project that kernel has as expected no problems at all :-/
This is really odd. In my effort to shrink it down to a minimal case that shows the error if I delete one specific device template function that is not even used it fixes it. But only if I completely delete the said functions source from the file. If I instead #ifdef the function out the bug remains.
This seems to indicate a bug in the CUDA build process that parses the source early on I guess?
I can’t devote anymore time to this. As far as I can tell watch functionality is completely unreliably broken depending on whatever random edits are made to source code including editing comments and renaming functions and variables.
Trying to recreate it in simple test projects has failed as it is completely random when it shows up. Trying to simplify our problem projects down to simple test cases has failed as it is completely random when it goes away depending on source code edits making the rate it would take to isolate months if not longer.
We will just have to assume watch functionality is not going to work anytime soon.
I tried updated to the new driver I saw today which is 411.70 and still the same issue. I didn’t realise the nextgen debugger was supposed to support WDDM now on Pascal (as I had tried previously to run it) but I just tried to run it with the new driver and I get:
“Could not initialize driver for debugging. Debugging has been automatically stopped. Please see output window for details.”
then:
“Attaching the Nsight VSE Debugger debugger to process failed. Operation not supported. Unknown error: 0x80004005”
The output window says:
"Could not initialize driver for debugging.
Debugging has been automatically stopped.
I have begun to wonder if the insanity of text encoding in VS2017 may also be causing problems for Nsight in lots of ways - including the watch? I’ve yet to go through and try and re-save all our source code back to UTF8 as I have no idea how long this has been silently happening.
It is an old post, but I’m getting the same issue as the original poster: cannot properly watch variables in CUDA kernels.
I started having this issue after adding Thrust, CUB and some template code in my CUDA source file.
(Legacy debugger on GeForce 940MX, CUDA 10.1 Update 1, driver 425.25, VS 2017 and 2019.)