When I tried LU decomposition and forward/backward substitution, I noticed exclusive stream execution on GPU.
The code performs LU for 8 independent matrices (dim 10k), and each one is assigned to a stream. What I noticed is that, for LU, fwd subs, streams can be executed concurrently, but for bwd subs, those streams are exclusive on GPU.
Is there any tool to monitor the stream efficiency (besides the nvprof)?
Also is there any tool to monitor the SM efficiency?
Thanks