How to count cuda cores with numba?

skulluglify · August 22, 2021, 2:30pm

i try it with, (MAX_THREADS_PER_MULTI_PROCESSOR / WARP_SIZE) * (MAX_THREADS_PER_MULTI_PROCESSOR / MAX_THREADS_PER_BLOCK) * MULTIPROCESSOR_COUNT

mx150 get output 384 cuda cores

it work, but idk it right or not, what do you think ?

Robert_Crovella · August 22, 2021, 7:33pm

None of the things you are listing have anything to do with CUDA cores per SM.

CUDA cores per SM * MULTIPROCESSOR_COUNT would be the correct formula

here is one possible method

skulluglify · August 23, 2021, 5:21am

thanks a lot, but to find cuda core per SM mostly use data dict from sm version because numba version ex on stackoverflow, this is not good for unknown version, while in numba, cuda core per SM is not known, but thread that appears. I am very enthusiastic to learn cuda from scratch, therefore I apologize for disturbing your time. by me

Robert_Crovella · August 23, 2021, 2:21pm

as far as I know something like a dict is necessary, because this information (cuda cores per SM) is an architectural item that is not retrievable programmatically.

Your formula won’t work in the general case.

(MAX_THREADS_PER_MULTI_PROCESSOR / WARP_SIZE) is just the max warps per multiprocessor. This has no connection from a design perspective to the number of CUDA cores per multiprocessor, and for most architectures (except Turing and certain Ampere family members) will be 64.

(MAX_THREADS_PER_MULTI_PROCESSOR / MAX_THREADS_PER_BLOCK) also has no connection to the number of cuda cores per multiprocessor. The max threads per block for all architectures is 1024.

Let’s apply your formula to another GPU and see if it works. Tesla V100 has 80 multiprocessors, and 5120 CUDA cores. For Tesla V100 your formula looks like this:

(MAX_THREADS_PER_MULTI_PROCESSOR / WARP_SIZE) * (MAX_THREADS_PER_MULTI_PROCESSOR / MAX_THREADS_PER_BLOCK) * MULTIPROCESSOR_COUNT

(64)*2*80 = 10240

So your formula does not give the correct answer for Tesla V100. I don’t think you will find a formula that works only with programmatically retrievable data, and is correct for all GPUs from compute capability 3.5 - 8.6, without using something like a dict. Yes, because of this, the dict has to be updated when new architectures become available.

Note that an intense focus on total number of CUDA cores (or CUDA cores per multiprocessor) is in my opinion not very sensible for a CUDA programmer. It’s not a top-level important concept for the CUDA programmer to take into account when designing their codes. You may disagree of course, that’s just my opinion. In any event I believe that is the reason that the information is not programmatically retrievable. It’s not considered to be that important for the programmer from a program design perspective.

skulluglify · August 24, 2021, 10:34am

I’m really stupid in asking this question, I’m sorry, I will follow your advice.