Hi, just a simple question:
How is “–default-stream per-thread” expected to work on multi-GPU environment, I’m expecting that each CPU thread has its own stream on each GPU, right?
Because when I run the application on multi-GPU environment, but use only single GPU, it looks like all my CUDA streams (from CPU threads) get serialized.
When I put “CUDA_VISIBLE_DEVICES=0” as prefix to my application, all my CUDA streams behave normally.
Is this compilation option not supported on multi-GPU systems?