Should legacy default stream behave serially under multiple host processes/contexts?

According to the document, there is one pariticular NULL stream for legacy default stream in one GPU device. If I use legacy default stream in multiple host processes, should that cause my kernel executions to be serialized? Since they should share the same default stream?
But in my experiments, these default streams can sometimes execute concurrently, like the picture:

As far as I know , pytorch uses legacy default stream rather than per-thread default stream, and I did the experiment with pytorch. It confuses me.
PS: I have turned on MPS

multiple processes ordinarily serialize kernel launches between processes. this is independent of which streams each process is using. Turning on MPS, however, allows at least the possibility that kernels from independent process can overlap. Again, this possibility is independent of which streams each process is using.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.