"--default-stream per-thread" on multi-GPU environment not working as expected?

Hi, just a simple question:

How is “–default-stream per-thread” expected to work on multi-GPU environment, I’m expecting that each CPU thread has its own stream on each GPU, right?

Because when I run the application on multi-GPU environment, but use only single GPU, it looks like all my CUDA streams (from CPU threads) get serialized.

When I put “CUDA_VISIBLE_DEVICES=0” as prefix to my application, all my CUDA streams behave normally.

Is this compilation option not supported on multi-GPU systems?

all streams, regardless of compile settings, have an inherent device association. Beyond that, I don’t know of any explanations for your observation/claims.