One way to do the L-infinity norm test is to have a convergence flag for each block in the kernel launch.

Each thread can check local convergence and set a flag in a shared memory array indicating whether it has converged. At the end of the kernel, do an in-shared memory parallel reduction, which leaves you with one flag per block, which indicates whether the whole block has converged or not. Write that convergence flag value back to global memory at the end of the kernel. You can then read that status array back to the host periodically and test it to see whether you should launch more kernels or not. So a bare bones implementation in the kernel might look something like this:

```
__global__ void(......., int *converged)
__shared__ volatile int local_convergence[NTHREADS];
__shared__ volatile int block_convergence;
if (threadIdx.x == 0) { block_convergence = converged[blockIdx.x] }; __syncthreads();
if (block_convergence) return;
// per thread computation goes here
// convergence test
local_convergence[threadIdx.x] = (fabs(new-old) < tol); __syncthreads();
// Parallel reduction of local_convergence into block_convergence here
if (threadIdx.x == 0) { converged[blockIdx.x] = block_convergence };
}
```

The effect is that once a block has converged, it won’t do any more computation, just exit. Once all the blocks have converged, the whole kernel launch will do nothing. Your host code can copy back the convergence array to check it at suitable intervals, lauching more kernels until everything has converged. The per block convergence array won’t be that large or take so much time to copy back to the host, and you can overlap the checking operation with kernel computation, making it effectively free of additional computational cost.