Reproducible arbitrary preemption examples?

Are there any known somewhat reproducible examples of arbitrary block preemption while the device is used exclusively by a single compute process? By several?

By arbitrary I mean that block scheduler wasn’t forced to preempt but did. A higher priority stream case I would call arbitrary, dynamic parallelism without synchronization too (although it is of less interest to me), but not debugging or dynamic parallelism with on-device cudaDeviceSynchronize. With several processes – that scheduler wasn’t forced by the process that suffered block preemption.

By somewhat reproducible I mean that it is possible to recreate such an event on certain models/architectures, Pascal or later, not necessarily on every try.

I tried to create such an example using stream priority and child grid launches without synchronization but didn’t succeed on two Turing models.