Shared Memory and number of Blocks invoked

I have a huge kernel and I use 15 Kb of shared memory per block ( occupying 64 threads).

Is it true that this limits the amount of blocks I can maximum invoke at once?

Simple yes,

because all blocks of a multiprocessor must be able to run concurrently. So with 15K per block the scheduler is only able to run one block per MP.


So I should retrieve the number of multiprocessors available on the device and that’s my limit of paralellization :(

NOPE. To get good parallelization you need to run more many blocks than multiprocessors. The 15kb shmem usage in your kernel only limits how many will run at once of course. But there is still a setup cost and driver overhead for each kernel launch. By launching more blocks in one kernel you can do more work with the same overhead.

Running more than one block on a multiproc really only benefits global memory/computation interleaving. If your kernel does only a few global memory reads and then a lot of arithmetic work on that shared memory, you won’t notice any performance hit due to only running one block per mp.

This simply depends on the number of threads per block you are running. It takes 192 threads to saturate register-hazzard latencies and 64 threads wont be sufficient to hide globalmem latency as well.

Your kernel is bound to perform poorly, unless you run multiple blocks per MultiProcessor OR increase your number of threads. But usually kernels perform great with just 64 threads and in such cases you usually run 3 to 4 blocks on a multi-processor.

So, My hunch is that this kernel’s performance is going to be poooooooor