Cuda single stepping and threads

Does single stepping one cuda kernel stop all other kernels, or can other kernels keep making progress?

CUDA-GDB Single Stepping https://docs.nvidia.com/cuda/cuda-gdb/index.html#single-stepping
Single-stepping device code is supported. However, unlike host code single-stepping, device code single-stepping works at the warp level. This means that single-stepping a device kernel advances all the active threads in the warp currently in focus. The divergent threads in the warp are not single-stepped.

In order to advance the execution of more than one warp, a breakpoint must be set at the desired location and then the application must be fully resumed.

A special case is single-stepping over a thread barrier call: __syncthreads(). In this case, an implicit temporary breakpoint is set immediately after the barrier and all threads are resumed until the temporary breakpoint is hit.

On GPUs with sm_type lower than sm_20 it is not possible to step over a subroutine in the device code. Instead, CUDA-GDB always steps into the device function. On GPUs with sm_typesm_20 and higher, you can step in, over, or out of the device functions as long as they are not inlined. To force a function to not be inlined by the compiler, the noinline keyword must be added to the function declaration.

With Dynamic Parallelism on sm_35, several CUDA APIs can now be instantiated from the device. The following list defines single-step behavior when encountering these APIs:

  • When encountering device side kernel launches (denoted by the <>> launch syntax), the step and next commands will have the same behavior, and both will step over the launch call.
  • When encountering cudaDeviceSynchronize, the launch synchronization routine, the step and next commands will have the same behavior, and both will step over the call. When stepping over the call, the entire device is resumed until the call has completed, at which point the device is suspended (without user intervention).
  • When stepping a device grid launch to completion, focus will automatically switch back to the CPU. The cuda kernel focus switching command must be used to switch to another grid of interest (if one is still resident).

Note: It is not possible to step into a device launch call (nor the routine launched by the call).

This is a careful (and useful) answer.
Thanks a lot