What is an effective way to find out this information?
I understand that cores should not be treated equally on CPUs vs GPUs.
If a warp can have 32 threads running concurrently? How many warps are running concurrently on my on my GeForce GTX 1080 Ti?
Understanding this would help be better design my application.
You have a number of active threads that the physical “GPU cores” are context switching between.
The number of active threads will depend on their resource requirements (register, shared memory) or hit the upper limits specified by your particular GPU:s compute capability (ex max 1024 threads per SM, and then you have N SM:s on your GPU).
The number of threads executing each clock-cycle should be equal to the total number of FPU:s/SP:S/“CUDA cores” on your device ( ~3500 ish on your card), so #warps = NbCores / 32.
1024 threads is limit per thread block, not SM
each GPU core may run up to 16 threads simultaneously. 1080Ti has 3584 cores, hence may run up to 16*3584 threads
I wouldn’t describe it that way. The maximum number of threads in flight is 2048 * # of SM, for all GPUs of compute capability 3.0 and higher (but less than 7.5: Turing GPUs are limited to 1024 threads/SM maximum)
This is an upper bound, not necessarily achievable with every code. Some codes may have resource utilization that dictates a lower maximum instantaneous thread carrying capacity (“occupancy”).
1080 Ti has 28 SMs, so the maximum instantaneous threads in flight number is 282048 (which does happen to be the same as 163584, however the 16*core count methodology will not give a correct upper bound for other GPUs that do not have 128 cores/SM, including all Kepler GPUs, and also cc 6.0 and 7.0 GPUs).
Not on all devices, Fermi only allows 1536 for the whole SM with a maximum of 1024 per block, and 1.X devices allow even fewer.