Achieving maximum occupancy

this might sound duplicate but i have searched a lot and haven’t found the right answer.
my device (940 mx) has 3 Streaming Multiprocessors and is of compute capability 5.0 which allows it to have up to 2048 threads/SM.
if the program’s resources are taken care of (reg/SM , reg/block, thread/block…), the device should be able to run 3*2048 threads simultaneously right?

Your thought process is correct.

The device should be able to support a complement of 3*2048 threads that are actively being scheduled on SMs.

“run simultaneously” is wording that can lead to arguments, so I’ll avoid simply answering with yes or no.

This doesn’t mean that a program that launches more than 32048 threads is poorly written or is a bad idea. But a program that has at least 32048 threads in a kernel launch has the opportunity to saturate that particular GPU, i.e. it has the opportunity to achieve maximum occupancy on that particular GPU (for that particular kernel launch).

thank you for your precise answer.