Ok, pinning everything would remove some unnecessary rows that separate the different streams, but still, we have around 27 streams, in 1, 2 or 3 GPU’s (depending on configuration).
The code is very large, there are many developers, and I need to make sure no thread is switching contexts among many other things, which is easier if I have a compact view of kernel execution and memory transfers, for all GPU’s.
The “CUDA Kernel running” and “Memory operation in progress” would be the lines I would like to see, if only I could differentiate individual kernels and transfers, without having to look on the individual streams.
Let me include an screenshot of a tiny and ugly example (with a few bad practices like DeviceSyncrhonize):
[url]
https://ibb.co/LnFtzNC[/url]
In here, you can see that I didn’t uncollapse the streams section. For a first review, I don’t need to. Mainly because I can see all kernels executions in the row “Compute”, and I can even see how they overlap their endings and begginings, because they where enqueued in different streams. So we don’t have between kernels waiting time.
I also can see, that the first three memory transfers, in the row “Memory” are HtoD because they are green, and they overlap with a large kernel execution, that is using data different from the one being transferred now, and enqueued in a different stream.
Finally, I can see that the execution ends with three transfers DtoD (in clear blue), that do overlap slightly among themselves and with the last kenel.
If there where DtoH transfers, they would appear on the Memory row in purple, and they can overlap with HtoD transfers, DtoD transfers, and kernels.
With this, in a very fast overview, I can see bad behaviors, and start looking for the guilty code.