CUDA 12.3 depend on scheduling rather than completion - how?

The CUDA 12.3 release notes indicate a new capability:

Launch completion events:

  • Allows a dependency on scheduling, but not completion, of all blocks in a kernel, enabling tighter control of scheduling.

How does one differentiate between a dependency on the scheduling of a kernel and a dependency on its execution? Where is that manifested in the (driver) API?

It’s about block scheduling rather than kernel scheduling.
With cudaLaunchKernelEx it is possible to record a cuda event automatically as soon as the last block in the kernel begins execution. (see cudaLaunchAttributeLaunchCompletionEvent, for example)

It should be similar to programmatic dependent launch introduced with Hopper. CUDA C++ Programming Guide