According to the document, there is one pariticular NULL stream for legacy default stream in one GPU device. If I use legacy default stream in multiple host processes, should that cause my kernel executions to be serialized? Since they should share the same default stream?
But in my experiments, these default streams can sometimes execute concurrently, like the picture:
As far as I know , pytorch uses legacy default stream rather than per-thread default stream, and I did the experiment with pytorch. It confuses me.
PS: I have turned on MPS