I need help Debugging

Windows 10, cuda 10.1, VS2017-19 V140

It is a race issue.

I have Cuda code that runs for 5-30 seconds in debug mode then hard locks the PC. win+shift+ctrl+B does not get me out, nothing does. it is a hard reset or wait for the watchdog then reboot anyway (At least I turned of the write Full memory Dump default setting, sheesh)

When I run in nsight extension so I can catch in cuda code it runs a little slower and never fails.
If I turn on the memory checker It will lock up much much sooner.
When I follow the kernel launch with a check last error and sync, I can get an illegaladdress error out of it.
the kernel is writing the pixels from decode to a BGRASurface connected to a texture and a cudaArray that is persistent.

I think maybe this is tied to the Direct3D NvDecoder example I use as the front end. It uses driver api and pushes and pops the context and i am using that context in my cuda. otherwise the runtime reinitialize the primary context, I end up with 2 and all streams are flattened into 1.

I cannot find an example of both integrating with direct3D and using cuda. The sampel that mixes driver api and runtime api writes to a file and does not try to display.

Is the NVDECODE api compatible with the runtime api? is there an example of decoding video using the runtime api? The examples seem to be form differnt eras.

The same kernel has run a 1/2 hour in nsight debug and a long time in nsight system profile on a video loop without issue at the resulting slower framerate. but when it hits the 30fps. ka boom.

A few crazy thoughts.
The D3DFramePresenter class has a crazy amount of push and pop context but the openGL does not have a one. Could this be a D3D integration issue? I need display during development and to monitor but OpenGL will be needed anyway on Linux, maybe I can avoid issues by just avoiding D3D.

It should be NVDecode->Cuda->sync->display