I have a use case where all kernels will aggregate results to the same memory address. In this case the CPU keeps launching kernels without waiting, until all kernels are finished, then it reads the result address.
In this manner, since kernel launch is non-blocking, the CPU will inevitably overload the GPU, what is the expected consequence on the GPU & CPU side?
What I currently observe is that CPU seems to be somewhat busy, and the overall time is lengthened.
Thanks for the help!