This is a question I had trouble with in the past where it was avoidable, but has come up again. It seems with CUDA accelerated numerical libraries, I have no freedom of controlling the number of thread blocks executed or controlling the number of SMs? Is this true? How can I enable execution on partial GPU resources? The reason why I ask is because I am doing research where I have to corun GPU programs. When corunning applications in the background, this can give some weird performance effects with cuSPARSE SPMV kernels. I want to at least try to control the resource allocation to isolate the effects.
I’m aware that CUDA APIs are optimized to allocate all the SMs to deliver the most throughput as possible. And it seems like I have no way to control the semantics of number of SMs/blocks executed. Am I wrong? If there’s an explicit way to do it (just like how you would bind/set threads in OpenMP), that would greatly help my use case. But deep digging shows me that is not really the case. Is there at least an implicit way to do this?
If not, is my next best option to try CUDA stream? The issue with CUDA streams is that it requires a refactor of my code, where I need to incorporate corun applications within my code that calls the cuSPARSE API. But I have no idea, so I would like some confirmation whether I can benefit from this.
In the past, I wanted to do distributed blocked GEMM on a DGX-1 multi GPU setup, and wanted to partition the matrices into subblocks and map it per SM. I partially solved this using CUDA streams, but I think I still ran into a bottleneck because this also included NVLink/NCCL communications, which I believe were blocking calls at the time. I used cuBLAS at the time, but I cannot really recall whether it was really blocked by NCCL calls or CUDA Streams didn’t concurrently run on a GPU from my experience of profiling with Nsight System. Thus, I worked around this by just calling 1 CUDA stream per GPU, rather than 1 CUDA stream per SM.
Unfortunately, I tried looking into this, but MPS is not supported in the platform I am evaluating: NVIDIA AGX Xavier. Is CUDA Stream the best approach then? If so, does that mean there is no explicit way to partition like you mentioned with MPS?
Maybe ask your question on the AGX Xavier forum. I can only work with the information you provide in your question, and you had not mentioned AGX Xavier until now.
I don’t know what the best approach is, because I don’t know what problem you are trying to solve. If your objective is to partition GPU resources, streams don’t allow you to do that, AFAIK.
I don’t know of a way to do it in general without using MPS. Let me repeat myself:
If you had exactly one kernel to run in conjunction with a library kernel, then simply launching your kernel first (using streams, appropriately) before the library kernel should allow them to “corun”, assuming there are available resources after launching your kernel. But in the general case there is no straightforward extension of this idea.
I will ask the AGX related question to the appropriate forum. From what I understand with the math libraries for my use case seems limited due to usage of AGX platform. My main question was whether math libraries can limit GPU resources, but it seems that is very limited at this time.
My use case was to try to do scaling experiments (per SMs) while corunning other GPU kernels to see the effects of contention.