Yes, that’s why I had hopes with MCDM and WDDM low-latency (Yes I am artifically increasing the number of links to that thread just in case somebody at NVidia will notice it and take time to answer)
A kernel that takes a performance hit from WDDM also takes a performance hit from kernel launch overhead in general, and increasingly so as faster GPUs are deployed in the future. That was the motivation behind my questions: Are there steps that could be taken to improve the general exposure of this code to the launch latency issue.
And you are right : there could have been be some ways that I did not even know I could explore. But that did not happen this time, it seems I am still bound to the WDDM limitation :-(
I might still try to create “super kernels” that just gather work from subkernels, but at the cost of more memory for the temporary buffers that will have to be duplicated for some concurrent (rather than pipelined) sub-kernels. This is a solution I wanted to avoid because of a lot of template instanciation and (very) long compile times, that will be even more critical with those superkernels.
(and this is a LOT of work and glue-code compared to the very few opportunites were it could take place)