Still cannot properly watch variables in CUDA kernels

In the CUDA samples, and my minimal test cases I can step through debug CUDA kernels fine now and correctly see the values of variables in the watch window.

In our main projects (that we cannot legally share right now) with more complicated kernels (and even simple ones in the same project) I cannot view the values of variables properly in the legacy debugger watch window which makes debugging a nightmare when we are trying to track issues down. All we can see is lots of errors in the watch window like:

“_ZN67_INTERNAL_45_tmpxft_00005044_00000000_7_test_cpp1_ii_0ee0ef6d6thrust12placeholders2_3E Could not resolve name”

That one is interesting as it pops up in breaking into a kernel that is not even using thrust! But where a kernel in the same file is using thrust? But basically all attempts to type known variable names into the CUDA warp watch window result in “Could not resolve name ‘XXXX’”

As far as we can tell the projects have exactly the same build settings as simple test projects that work fine regards the watch windows when debugging.

The only difference I can see between our main projects that have this problem and our tests right now is that we use two GPU’s (so CUDA is creating two contexts) in the main projects but just one so far in our tests.

Any ideas about what could be causing this?

Win7 64bit, GTX1080Ti & GTX1080, VS2017 15.8.1 (And earlier), NSight 5.6, Driver 398.82

Relevant CUDA Debug Build settings (Win64 AVX build):

Generate Relocatable Device Code: No
NVCC Compilation Type: Generate hybrid object file (–compile)
CUDA Runtime: Static CUDA runtime library (-cudart static)
Target Machine Platform: (–machine64)
Interleave source in PTX: No
Code Generation: compute_61,sm_61
Generate GPU Debug Information: Yes (-G)
Generate Line Number Information: No
Max Used Register: 0
Verbose PTXAS Output: No
Perform Device Link: Yes (-dlink)

And we just checked making our multi-GPU project use one GPU and still the same problem unfortunately. We can’t see any difference that could be causing this problem now.

Even pasting a kernel that we can properly watch from a simple test project that works across to these projects has the same problem. No idea why the tools are failing in this case.

Not sure how to help you, this probably caused by the cuda compiler which doesn’t correctly generate the symbol name .

I’ve tried introducing thrust and cub usage into simple test projects that work and they continue to work so far. I will try and remake from scratch the troublesome projects solution files when we have more time (but every setting is the same as the simple test projects that work). I’ve also tried reducing threads per block with no joy.

Is there anything that can be done to validate the symbols the cuda compiler creates to see if it is that?

Can you try toe set Relocatable Device Code to yes and recompile your codes?

Had a chance to look at this again today. Tried to set relocatable device code to yes and still the same issue. We haven’t been able to reliably debug cuda kernels for a very long time now.

I really don’t know how to help you now, I shall raise a bug for you, maybe you can wait for the nsight 6.0 and have a check, it will be released in the next few weeks.

Hi cybernoid,

Since you can’t share us your projects to see the specific issue currently ,we have no idea what we can do for your issue ,and the Watch view from our side works well.
Hope our new release Nsight 6.0 (planned around 9/20/2018) can help you.

I’ve just tried Nsight 6.0 and tried a build against CUDA 10 with driver 411.63 and I still have the same problem unfortunately.

The only difference is I now see these errors in the watch window:

“Condition(false) in method: Void TypeCheckObjectName(Nvda.CppExpressions.FrontEnd.CppParseArguments)”

Actually I just caught it with a new project transitioning from working properly to exhibiting the problem. I rolled back the source file in our source control and it started working again. All that had changed is one single .cu file where:

  1. ~5 more device and host functions were added
  2. ~3 more kernels were added
  3. No header includes were changed
  4. Some of the functions added were template ones used in some of the kernels
  5. More kernel launches were added
  6. No using namespaces were changed
  7. More comments were added
  8. Kernel launch parameters were changed to process much larger arrays (but I try changing the newer to use the older smaller sizes and still the same problem)

So it seems it’s nothing to do with project settings, but is specific to the source code in the file. Unfortunately yet again this project is using a load of code I am not currently at liberty to share right now. Previously I had tried to recreate the same problem with a simple test without those libraries but no joy so far - but due to that single source file change causing the problem I don’t think it’s specific to those.

Interestingly when I rename the name of some the kernels the errors in the watch window have changed from:

“Condition(false) in method: Void TypeCheckObjectName(Nvda.CppExpressions.FrontEnd.CppParseArguments)”

to

“Could not resolve name ‘xxxx’”

I renamed some of the kernels as a test just now as I noticed some of the device functions they call have in some cases the same name but different function signatures (so in C++ terms are valid different overloads).

I’ve maybe managed to isolate it a bit more given the following simplified kernel:

template<uint32_t _ThreadItemCount, uint32_t _BlockThreadCount>
__global__ void Test(uint32_t* __restrict values, uint32_t valueCount)
{
	typedef cub::BlockStore<uint32_t, _BlockThreadCount, _ThreadItemCount, cub::BLOCK_STORE_WARP_TRANSPOSE> BlockStore;
	typedef cub::BlockLoad<uint32_t, _BlockThreadCount, _ThreadItemCount, cub::BLOCK_LOAD_WARP_TRANSPOSE> BlockLoad;


	__shared__ union
	{
		typename BlockStore::TempStorage store;
		typename BlockLoad::TempStorage load;
	}
	shared;


	uint32_t blockItemCount = _ThreadItemCount * _BlockThreadCount;
	uint32_t priorBlocksItemCount = blockIdx.x * blockItemCount;

	uint32_t items[_ThreadItemCount];


	//	Adjust segment range depending on block
	values += priorBlocksItemCount;
	valueCount = min(valueCount - priorBlocksItemCount, blockItemCount);


	BlockLoad(shared.load).Load(values, items, valueCount);

	#pragma unroll
	for (uint32_t itemIndex = 0; itemIndex < _ThreadItemCount; ++itemIndex)
		++items[itemIndex];

	__syncthreads();
	BlockStore(shared.store).Store(values, items, valueCount);
}

ThreadItemCount = 7
BlockThreadCount = 128
valueCount = 32 * 1024 * 1024 * 2 = 67 108 864
Launch Test<ThreadItemCount, BlockThreadCount> thread count: 9 586 981 block count: 74 899 block thread count: 128

This is a very simplified version of the first kernel launched in a sequence of kernels in the file change I isolated. If I breakpoint into the kernel on any line I cannot watch the variables as described, or in any of the subsequent kernels. As far as I can tell it executes and modifies the array of values correctly though. Enabling memory checking from Nsight flags no errors. Even if it was stomping over memory catching it at a breakpoint at the start should allow me to watch variables until it does.

Interestingly if I comment out the line with the block store I can then breakpoint into this kernel and subsequent kernels and watch variables without issue.

If I manage to find the time to get a full working extracted isolated example that I can share I will, but it’s a question of finding the time to track this down further, put in the work required and luck as I am not totally convinced of the cause yet.

Ok I extracted that exact same previously stated kernel into a separate test project and it now works and I can step through and watch it fine (Hence the problem of trying to create an isolated repo for you). So I have become more convinced that:

  1. It’s most likely not a bug in our code, or a library such as CUB
  2. It’s not a memory overwrite error as catching it in the debugger before any memory overwrites has the same problem in projects where it happens, and the memory checker shows no errors
  3. You have some sort of bug in nsight or your generation process that creates the information it uses

It seems like whenever a CUDA project adds multiple kernels or additional device functions in a file this bug appears. But what exact arcane magic summons it I have no idea!

As you have seen I have taken one kernel that exhibited the problem, that when commented out in one project fixes it so all the kernels in that project can be watched properly. Which would indicate that the kernel has some problem - except there seems to be nothing wrong with the kernel, and when it is moved to a separate test project that kernel has as expected no problems at all :-/

This is really odd. In my effort to shrink it down to a minimal case that shows the error if I delete one specific device template function that is not even used it fixes it. But only if I completely delete the said functions source from the file. If I instead #ifdef the function out the bug remains.

This seems to indicate a bug in the CUDA build process that parses the source early on I guess?

I can’t devote anymore time to this. As far as I can tell watch functionality is completely unreliably broken depending on whatever random edits are made to source code including editing comments and renaming functions and variables.

Trying to recreate it in simple test projects has failed as it is completely random when it shows up. Trying to simplify our problem projects down to simple test cases has failed as it is completely random when it goes away depending on source code edits making the rate it would take to isolate months if not longer.

We will just have to assume watch functionality is not going to work anytime soon.

The Next-Gen CUDA debugger should also support your GTX 1080 & 1080TI (in both WDDM and TCC mode).

Can you try the Next-Gen debugger? It uses a different expression engine and you may have better luck.
If not, please post your results.

Thanks

I tried updated to the new driver I saw today which is 411.70 and still the same issue. I didn’t realise the nextgen debugger was supposed to support WDDM now on Pascal (as I had tried previously to run it) but I just tried to run it with the new driver and I get:

“Could not initialize driver for debugging. Debugging has been automatically stopped. Please see output window for details.”

then:

“Attaching the Nsight VSE Debugger debugger to process failed. Operation not supported. Unknown error: 0x80004005”

The output window says:

"Could not initialize driver for debugging.
Debugging has been automatically stopped.

Please see: https://developer.nvidia.com/nsight-visual-studio-edition-supported-gpus-full-list#SupportedComputeConfigs"

I also tried to build as both Cuda 9.2 and 10.0

Win7 64bit, GTX1080Ti & GTX1080, VS2017 15.8.5, NSight 6.0, Driver 411.70

Due to playing around with nsight compute I came across this problem:

https://devtalk.nvidia.com/default/topic/1042285/nsight-compute-/possible-problem-viewing-source-with-vs2017-text-encoding/

https://visualstudio.uservoice.com/forums/121579-visual-studio-ide/suggestions/33917956-do-not-create-utf16-source-code-files-it-is-not-c

I have begun to wonder if the insanity of text encoding in VS2017 may also be causing problems for Nsight in lots of ways - including the watch? I’ve yet to go through and try and re-save all our source code back to UTF8 as I have no idea how long this has been silently happening.

Sorry, you clearly showed that you’re using win7. The Next-Gen debugger was designed for win10 RS3 or later.

Is it possible to send us the simplified kernal source file and the executable built with file, so that we can reproduce, debug, and fix this issue?

It is an old post, but I’m getting the same issue as the original poster: cannot properly watch variables in CUDA kernels.
I started having this issue after adding Thrust, CUB and some template code in my CUDA source file.
(Legacy debugger on GeForce 940MX, CUDA 10.1 Update 1, driver 425.25, VS 2017 and 2019.)

Actually I pinned this down to one change:

  • adding: #include <cub/cub.cuh>
  • adding in one kernel a BlockReduceT call,
  • adding very simple supporting code and variables (a for loop, a pointer parameter to kernel and an atomicAdd()).

I will try to generate a small repro case, but as original poster explained - it might not be possible.