Running a simple test case on CUDA 12.1, GTX 1660 Super (22 SMs), I see reasonable behavior:
$ cat t40.cu
#include <cublas_v2.h>
#include <iostream>
using mt = double;
int main(){
int s = 512;
cublasHandle_t h;
cublasCreate(&h);
cublasStatus_t stat;
#ifdef USE_SET
stat = cublasSetSmCountTarget(h, 8);
if (stat != CUBLAS_STATUS_SUCCESS) std::cout << "CUBLAS SET SM ERROR: " << (int)stat << std::endl;
#endif
mt *A, *B, *C;
cudaMalloc(&A, s*s*sizeof(mt));
cudaMalloc(&B, s*s*sizeof(mt));
cudaMalloc(&C, s*s*sizeof(mt));
double alpha = 1.0;
double beta = 0.0;
stat = cublasGemmEx(h,
CUBLAS_OP_N,
CUBLAS_OP_N,
s,
s,
s,
&alpha,
A,
CUDA_R_64F,
s,
B,
CUDA_R_64F,
s,
&beta,
C,
CUDA_R_64F,
s,
CUBLAS_COMPUTE_64F,
CUBLAS_GEMM_DEFAULT);
cudaError_t err = cudaDeviceSynchronize();
std::cout << (int)stat << "," << (int)err << std::endl;
}
$ nvcc -o t40 t40.cu -lcublas
$ nsys nvprof --print-gpu-trace ./t40
WARNING: t40 and any of its children processes will be profiled.
0,0
Generating '/tmp/nsys-report-e4b9.qdstrm'
[1/3] [========================100%] report10.nsys-rep
[2/3] [========================100%] report10.sqlite
[3/3] Executing 'cuda_gpu_trace' stats report
Start (ns) Duration (ns) CorrId GrdX GrdY GrdZ BlkX BlkY BlkZ Reg/Trd StcSMem (MB) DymSMem (MB) Bytes (MB) Throughput (MBps) SrcMemKd DstMemKd Device Ctx Strm Name
----------- ------------- ------ ---- ---- ---- ---- ---- ---- ------- ------------ ------------ ---------- ----------------- -------- -------- --------------------------------- --- ---- ---------------------
792,242,645 672 2,122 0.000 190.476 Device NVIDIA GeForce GTX 1660 SUPER (0) 1 7 [CUDA memset]
792,286,773 2,083,237 2,124 4 8 2 128 1 1 234 0.025 0.000 NVIDIA GeForce GTX 1660 SUPER (0) 1 7 volta_dgemm_128x64_nn
Generated:
/home/bob/bobc/misc/report10.nsys-rep
/home/bob/bobc/misc/report10.sqlite
$ nvcc -o t40 t40.cu -lcublas -DUSE_SET
$ nsys nvprof --print-gpu-trace ./t40
WARNING: t40 and any of its children processes will be profiled.
0,0
Generating '/tmp/nsys-report-90a4.qdstrm'
[1/3] [========================100%] report11.nsys-rep
[2/3] [========================100%] report11.sqlite
[3/3] Executing 'cuda_gpu_trace' stats report
Start (ns) Duration (ns) CorrId GrdX GrdY GrdZ BlkX BlkY BlkZ Reg/Trd StcSMem (MB) DymSMem (MB) Bytes (MB) Throughput (MBps) SrcMemKd DstMemKd Device Ctx Strm Name
----------- ------------- ------ ---- ---- ---- ---- ---- ---- ------- ------------ ------------ ---------- ----------------- -------- -------- --------------------------------- --- ---- --------------------
795,244,573 2,074,086 2,121 8 8 1 64 1 1 242 0.017 0.000 NVIDIA GeForce GTX 1660 SUPER (0) 1 7 volta_dgemm_64x64_nn
Generated:
/home/bob/bobc/misc/report11.nsys-rep
/home/bob/bobc/misc/report11.sqlite
$
Without the SM count target, CUBLAS chooses to launch a grid of 4x8x2 blocks, each block having 128 threads. With the SM count target, CUBLAS has chosen a grid of 8x8x1 blocks, each block having 64 threads. So we see the API seems to have an effect in this particular case.
However you may have a misunderstanding of what it does. It does not “limit” CUBLAS to only use 8 SMs (or whatever target you specify), regardless of what else is happening on the GPUs. Instead, it optimizes its grid size so that if only that many SMs are available, it will give best behavior. I suggest reading the description carefully:
This option can be used to improve the library performance when cuBLAS routines are known to run concurrently with other work on different CUDA streams. E.g. a NVIDIA A100 GPU has 108 SM and there is a concurrent kenrel running with grid size of 8, one can use cublasSetSmCountTarget with value 100 to override the library heuristics to optimize for running on 100 multiprocessors.
Nowhere does it say it will limit its SM usage. It simply states that this will attempt to optimize performance in a situation where the SM count may be limited because another kernel is expected to be running, preventing the CUBLAS call from having access to all SMs.
And if CUBLAS decides that its default grid configuration is already optimal for the case where it only has access to fewer SMs, then there may be no difference in usage with or without the target setting.