I’m curious to know how, if possible, can decomposing one large kernel into multiple streams instead of one be useful for performance. For example, in the case of CuSparseLt library calls like matrix multiplication, two arguments are the number of streams and an array of streams. I don’t know how these streams are used internally, but I’m guessing they are used to decompose the original kernel and run it on multiple streams. So, can streams be used to increase overall occupancy on the device beyond what one kernel is capable of, or are there other cases it can be useful, say for a matrix multiplication kernel?