Is there a mechanism to make the launch of a kernel on one stream, let’s call it kernel_stream2, dependent on the launch of a kernel on a different stream, kernel_stream1, such that kernel_stream2 always starts after kernel_stream1, but I still want them to be able to overlap their execution? In my use case, setting CUDA_DEVICE_MAX_CONNECTIONS=1 is not a viable option as it harms parallelism.
One possibility might be programmatic dependent launch. It doesn’t do exactly what you are asking, but it is a way to do fine-grained sequencing of one kernel launch after another.
I believe that programmatic dependent launch meets your description with the exception that the two kernels in question are/must be in the same stream. This is sort of weird because in this particular case, overlap is still possible (something we would not normally expect to see for two kernels launched into the same stream.)
I don’t have further suggestions, or anything that would be like programmatic dependent launch but allowing for the connection of kernels launched into two separate streams.
Perhaps you should read the documentation I linked and see if you agree with my assessment. But beyond that, I have no further suggestions.