[Stream Efficiency] Noticed exclusive stream exec in LU. How can we monitor the stream efficiency?

When I tried LU decomposition and forward/backward substitution, I noticed exclusive stream execution on GPU.

The code performs LU for 8 independent matrices (dim 10k), and each one is assigned to a stream. What I noticed is that, for LU, fwd subs, streams can be executed concurrently, but for bwd subs, those streams are exclusive on GPU.

Is there any tool to monitor the stream efficiency (besides the nvprof)?
Also is there any tool to monitor the SM efficiency?

Thanks

I’m not sure what you mean by stream efficiency…

Remember all non-default streams out-of-order and concurrently. Their ability to run concurrently is directly related to the resources available on the target GPU. If a particular kernel in a given stream is utilizing all available resources, streams will be serialized.

A more important metric to look at is kernel occupancy.