I have no answers, but some related points:
-
In addition to the limits you mention, there is a limit on the number of resident grids per device.
See Table 18. -
If you use dynamic parallelism, you can easily reach the grid limit using only one stream created by the host.
An approach I have been exploring recently:
- create only a small number of streams on the host, e.g., one or two streams for each desired stream priority
- use a combination of host launches and dynamic parallelism to achieve the concurrency you want
In addition to the links you posted, beware of Increased time to synchronize….