Does cudaStreamWaitEvent(stream2, event1, 0) also block the stream to record event1?

First of all, even if you have satisfied all necessary conditions, CUDA provides no guarantees of any sort of concurrency.

As a practical matter, to witness kernel concurrency, you should first verify that the GPU has sufficient resources to run both kernels at the same time. If you want help with this aspect, it’s a good idea to let others know what GPU you are running on.

If you are running on CUDA 12.2 or newer, my guess would be that you are running into CUDA lazy module loading. In your test case, you call each kernel only once, so each kernel will force a device sync if lazy loading is in effect. This would prevent any sort of kernel concurrency, even if you have properly provided for it.

You could test this by running (i.e. profiling) your code with:

CUDA_MODULE_LOADING=EAGER nsys profile ...