finding the best number of threads per block

Hi everyone,

I was wondering what would be the best number of threads per block to use when adding
a matrix with another matrix. How do you determine this number?

Also somewhere I heard that you can run 768 threads in a block. Are all of these threads
executed at the same time?


Old fashioned profiling and benchmarking. There is no panacea or universally correct numbers. Your matrix add code and mine might be different and require different execution parameters for optimal performance. NVIDIA supply an occupancy calculation spreadsheet you can play around with to get a feel for how the GPU will schedule your kernel for different execution parameters.

You heard wrong. The per block limit is 512 on all current hardware. The thread scheduling hardware connected to each Multiprocessor in CUDA capable GPUs has a maximum of either 768 or 1024 threads, depending on which hardware you have. Those up to 768/1024 threads come from multiple blocks, if other resources like registers and shared memory permit. Only 32 threads are executed simultaneously on any given Multiprocessor (groups of 32 are called “warps” in CUDA terminology, and the warp is the basic scheduling unit). Other active threads are either queued for execution, or stalled for pending memory transactions or instruction fetches. All of this is discussed in some detail in chapters 4 and 5 of the programming guide.

Hmm I’m going to quote the paper where I read this from. So is the paper wrong? Or each block runs on a single stream-processor?

"For example, NVIDIA’s 8800 GTX GPU–the GPU used in this work–has

16 multiprocessors, each of which supports 768 concurrent execution threads. Combined,

these multiprocessors allow the GPU to manage over 12,000 concurrent execution


Later on it says:

This one-to-one mapping of threads-to-records lets large amounts of data with 12,000 concurrent parallel operations at any one time.

That quotation doesn’t contradict a word I wrote. On compute capability 1.1 hardware, the limit is 768 active/concurrent threads per multiprocessor (not per block like you wrote), which are executed in SIMD fashion in warps of 32 threads. Up to 512 threads per block, maximum of 8 active/concurrent blocks per multiprocessor, with a total of 8192 registers and 16kb share memory per mutliprcoessor. Blocks are scheduled at the multiprocessor level, threads at the streaming processor level, but in groups (“warps”).

This all comes straight from appendix A of the programming guide. I invite you to read it for yourself if you are still skeptical.