I’m one newbie and I used GTX 1650 which has 4GB GPU mem.
I tried change the thread number with this code by choosing appropriate N to fit data in the GPU mem:
float *x, *y, *d_x, *d_y;
long long int N = 122064896;
cudaMalloc(&d_x, N*sizeof(float));
cudaMalloc(&d_y, N*sizeof(float));
...
saxpy<<<(N+255)/256, 256>>>(N, 2.0f, d_x, d_y);
running /opt/cuda/extras/compute-sanitizer/compute-sanitizer --tool memcheck cudaSaxpy.o works fine.
However, above N seems to excess the thread num 1024*14=14336:
$ /opt/cuda/extras/demo_suite/deviceQuery
(14) Multiprocessors, ( 64) CUDA Cores/MP: 896 CUDA Cores
Maximum number of threads per multiprocessor: 1024
Q:
Does the GPU help the programmer so that it only runs 14336 threads initially then use spare thread when one thread finishes its work instead of that GPU directly runs 122064896 threads?
I also find the Compute Capability list shows same “Maximum x -dimension of a grid of thread blocks” for all “Technical Specifications”. So does it only mean maximum in theory but the maximum may not be achieved for all specification version?
To say more specifically, my gpu can’t execute $2^31-1$ threads concurrently because they exceeds the physical threads. Is that case?
In CUDA programming it is entirely normal to use grids that comprise more threads than are physically able to execute concurrently on a particular GPU. The CUDA programming model does not make any specific guarantees in which order the threads in the grid execute on the hardware, only that they will all get to run eventually.
I read your second question twice and have as of now not come to a conclusion what it is asking about.
Thanks for the quick answer. Maybe I said not clearly (I have changed my question after reading your answer). And these 2 questions are similar, you have answered them all.
yes, the maximum theoretical number here is number of SMs times max number of threads per SM.
You can look up the max number of threads per SM for your compute capability in the programming guide, or just run deviceQuery. deviceQuery will also tell you how many SMs are in your GPU.
This question relates to the general topic of CUDA occupancy.
Kernel design (resource utilization) can lower the maximum number of threads that an SM can “hold”.