Kernel redundancy


Argue that I have a mission-critical - or simply critical or resource-intensive - application, implemented on a device
Processing itself takes a significant amount of time, such that kernel redundancy becomes relevant
The algorithm itself is relatively complex, such that redundancy can not simply be guaranteed by mere debugging (there are simply too many possible execution paths, and at any point, a execution path not yet stepped through debugging can be triggered)

If one then defines kernel redundancy as:

  • outcomes of kernel functions
  • average time to execute kernel functions
  • sequence in which kernel functions are executed
  • relative position of threads, relative to other threads of the same block

The former 3 points are relatively easy to monitor and implement
But, many times when I find a kernel to be unstable, it is manifested as the last point rather

I suppose 1 thread per warp block can report its position and note that of other warp blocks, at designated points; and on a marked divergence, cause the entire warp block to terminate via assert()
But there is no guarantee that warp blocks would always reach such evaluation points, when in trouble
The better approach may be to simply have a thread in the warp block make the information available to the host, and have the host process and react thereupon
But this would require a way for the host to terminate the work on the device, if needed

How does the debug perspective/ gdb manage to terminate kernels seemingly from the host?