I didn’t say that. I’m trying to help you explain the behavior you witnessed. The GPU block scheduler chooses work according to its own, unpublished heuristic, and as far as I know there is no requirement or specification that it choose new blocks to schedule from any particular kernel, when multiple kernel launches are outstanding. So your behavior is plausible, and, in my experience, typical. We could presume that the GPU designers are generally desiring to create a machine that delivers high performance in many scenarios, but extrapolating those very general statements to a specific claim like this about a specific, but loosely defined workflow, is not sensible in my opinion.
Not sure what you mean. In any event I can’t make any definitive statements about proving anything about a workload that you haven’t defined, except to say a GEMM kernel and another kernel, with an undefined definition of “performance”. Anyway I don’t think I would be able to respond to further questions about this.
You could investigate:
Stream priorities - this essentially gives you some control over block scheduler behavior
CUDA MPS (That would require you to launch the work in two separate processes, and to use the active thread percentage feature - probably not very convenient.)
I do not know if or how well either of the above mechanisms would work for your case - I have not tried them with your case.
There isn’t anything I know of that pertains to the GEMM function call itself that allows you to arbitrarily limit the resources it will use. You have access to the BLAS reference manual, so I’m pretty sure you can discover this for yourself with a bit of research. You may find something there that serves your purpose that I hadn’t considered.
A BLAS GEMM operation can be decomposed into a set of smaller operations. You could do that manually, yourself, if you wished. In so doing, by reducing each operation to a very small level, you could eventually end up with small kernels that don’t saturate the GPU (albeit having to launch potentially many of them). In such a way you could “indirectly limit” the resources used. Such an approach strikes me as foolishness/madness, but I mention it because it is perhaps another possible way to limit the resource consumption of the BLAS operation.