cublasSetSmCountTarget

Hi,

I am trying to reduce number of SMs for DGEMM (cublasGemmEx with CUDA_R_64F) operation like this:

cublasStatus_t st = cublasSetSmCountTarget(cublas_handle(), 8);

I am also checking the status and it is successful.

CUBLAS_STATUS_SUCCESS

But after getting the trace I am seeing the grid size and block size is not changing. Also the execution time is same. So it means that the number of SM has not changed. What I am doing wrong?

CUDA Version: 11.4 and V100s GPU.

Running a simple test case on CUDA 12.1, GTX 1660 Super (22 SMs), I see reasonable behavior:

$ cat t40.cu
#include <cublas_v2.h>
#include <iostream>

using mt = double;
int main(){

  int s = 512;

  cublasHandle_t h;
  cublasCreate(&h);
  cublasStatus_t stat;
#ifdef USE_SET
  stat = cublasSetSmCountTarget(h, 8);
  if (stat != CUBLAS_STATUS_SUCCESS) std::cout << "CUBLAS SET SM ERROR: " << (int)stat << std::endl;
#endif
  mt *A, *B, *C;
  cudaMalloc(&A, s*s*sizeof(mt));
  cudaMalloc(&B, s*s*sizeof(mt));
  cudaMalloc(&C, s*s*sizeof(mt));
  double alpha = 1.0;
  double beta  = 0.0;
  stat = cublasGemmEx(h,
                           CUBLAS_OP_N,
                           CUBLAS_OP_N,
                           s,
                           s,
                           s,
                           &alpha,
                           A,
                           CUDA_R_64F,
                           s,
                           B,
                           CUDA_R_64F,
                           s,
                           &beta,
                           C,
                           CUDA_R_64F,
                           s,
                           CUBLAS_COMPUTE_64F,
                           CUBLAS_GEMM_DEFAULT);
  cudaError_t err = cudaDeviceSynchronize();
  std::cout << (int)stat << "," << (int)err << std::endl;
}
$ nvcc -o t40 t40.cu -lcublas
$ nsys nvprof --print-gpu-trace ./t40
WARNING: t40 and any of its children processes will be profiled.

0,0
Generating '/tmp/nsys-report-e4b9.qdstrm'
[1/3] [========================100%] report10.nsys-rep
[2/3] [========================100%] report10.sqlite
[3/3] Executing 'cuda_gpu_trace' stats report

 Start (ns)   Duration (ns)  CorrId  GrdX  GrdY  GrdZ  BlkX  BlkY  BlkZ  Reg/Trd  StcSMem (MB)  DymSMem (MB)  Bytes (MB)  Throughput (MBps)  SrcMemKd  DstMemKd               Device                Ctx  Strm          Name
 -----------  -------------  ------  ----  ----  ----  ----  ----  ----  -------  ------------  ------------  ----------  -----------------  --------  --------  ---------------------------------  ---  ----  ---------------------
 792,242,645            672   2,122                                                                                0.000            190.476  Device              NVIDIA GeForce GTX 1660 SUPER (0)    1     7  [CUDA memset]
 792,286,773      2,083,237   2,124     4     8     2   128     1     1      234         0.025         0.000                                                     NVIDIA GeForce GTX 1660 SUPER (0)    1     7  volta_dgemm_128x64_nn

Generated:
    /home/bob/bobc/misc/report10.nsys-rep
    /home/bob/bobc/misc/report10.sqlite
$ nvcc -o t40 t40.cu -lcublas -DUSE_SET
$ nsys nvprof --print-gpu-trace ./t40
WARNING: t40 and any of its children processes will be profiled.

0,0
Generating '/tmp/nsys-report-90a4.qdstrm'
[1/3] [========================100%] report11.nsys-rep
[2/3] [========================100%] report11.sqlite
[3/3] Executing 'cuda_gpu_trace' stats report

 Start (ns)   Duration (ns)  CorrId  GrdX  GrdY  GrdZ  BlkX  BlkY  BlkZ  Reg/Trd  StcSMem (MB)  DymSMem (MB)  Bytes (MB)  Throughput (MBps)  SrcMemKd  DstMemKd               Device                Ctx  Strm          Name
 -----------  -------------  ------  ----  ----  ----  ----  ----  ----  -------  ------------  ------------  ----------  -----------------  --------  --------  ---------------------------------  ---  ----  --------------------
 795,244,573      2,074,086   2,121     8     8     1    64     1     1      242         0.017         0.000                                                     NVIDIA GeForce GTX 1660 SUPER (0)    1     7  volta_dgemm_64x64_nn

Generated:
    /home/bob/bobc/misc/report11.nsys-rep
    /home/bob/bobc/misc/report11.sqlite
$

Without the SM count target, CUBLAS chooses to launch a grid of 4x8x2 blocks, each block having 128 threads. With the SM count target, CUBLAS has chosen a grid of 8x8x1 blocks, each block having 64 threads. So we see the API seems to have an effect in this particular case.

However you may have a misunderstanding of what it does. It does not “limit” CUBLAS to only use 8 SMs (or whatever target you specify), regardless of what else is happening on the GPUs. Instead, it optimizes its grid size so that if only that many SMs are available, it will give best behavior. I suggest reading the description carefully:

This option can be used to improve the library performance when cuBLAS routines are known to run concurrently with other work on different CUDA streams. E.g. a NVIDIA A100 GPU has 108 SM and there is a concurrent kenrel running with grid size of 8, one can use cublasSetSmCountTarget with value 100 to override the library heuristics to optimize for running on 100 multiprocessors.

Nowhere does it say it will limit its SM usage. It simply states that this will attempt to optimize performance in a situation where the SM count may be limited because another kernel is expected to be running, preventing the CUBLAS call from having access to all SMs.

And if CUBLAS decides that its default grid configuration is already optimal for the case where it only has access to fewer SMs, then there may be no difference in usage with or without the target setting.

Thank you for detailed answer.

It seems that this function was introduced since version 11.3.1 of CUDA but for 11.4 it has been disappeared.

Agreed, sorry about the confusion, I have edited my answer to remove the incorrect statement.

Is it possible to just update the cuBLAS part to have this support and leave the rest of the CUDA with old version.

I wouldn’t expect that to work. The CUBLAS library depends on the CUDA runtime library i.e. “the rest of CUDA”. Some people try to do this kind of trickery by renaming libraries or creating symlinks, but in my view that is just asking for trouble elsewhere. I don’t have a recipe or any suggestions for you.