Normally this would be via a kernel launch. Write the data, launch the kernel.
Presumably then you are talking about some method using persistent threads/persistent kernels:
In that case, you would need a semaphore that the GPU persistent master thread is polling. The semaphore would also be in mapped memory, and the host would write the semaphore after filling in the data packet(s). The GPU persistent master thread would then “signal” other threads/threadblocks to begin processing the packet(s). In the case of CUDA Dynamic Parallelism, this could be conceptually simplified by having the GPU master thread, once the semaphore is detected, launch child kernels to process the packet(s). However you may not want to follow this approach for performance (in particular, latency) reasons.