finding the best number of threads per block

afflictedd2 · January 28, 2010, 10:39pm

Hi everyone,

I was wondering what would be the best number of threads per block to use when adding
a matrix with another matrix. How do you determine this number?

Also somewhere I heard that you can run 768 threads in a block. Are all of these threads
executed at the same time?

Ted.

avidday · January 28, 2010, 11:23pm

Old fashioned profiling and benchmarking. There is no panacea or universally correct numbers. Your matrix add code and mine might be different and require different execution parameters for optimal performance. NVIDIA supply an occupancy calculation spreadsheet you can play around with to get a feel for how the GPU will schedule your kernel for different execution parameters.

You heard wrong. The per block limit is 512 on all current hardware. The thread scheduling hardware connected to each Multiprocessor in CUDA capable GPUs has a maximum of either 768 or 1024 threads, depending on which hardware you have. Those up to 768/1024 threads come from multiple blocks, if other resources like registers and shared memory permit. Only 32 threads are executed simultaneously on any given Multiprocessor (groups of 32 are called “warps” in CUDA terminology, and the warp is the basic scheduling unit). Other active threads are either queued for execution, or stalled for pending memory transactions or instruction fetches. All of this is discussed in some detail in chapters 4 and 5 of the programming guide.

afflictedd2 · January 29, 2010, 1:11am

Old fashioned profiling and benchmarking. There is no panacea or universally correct numbers. Your matrix add code and mine might be different and require different execution parameters for optimal performance. NVIDIA supply an occupancy calculation spreadsheet you can play around with to get a feel for how the GPU will schedule your kernel for different execution parameters.

You heard wrong. The per block limit is 512 on all current hardware. The thread scheduling hardware connected to each Multiprocessor in CUDA capable GPUs has a maximum of either 768 or 1024 threads, depending on which hardware you have. Those up to 768/1024 threads come from multiple blocks, if other resources like registers and shared memory permit. Only 32 threads are executed simultaneously on any given Multiprocessor (groups of 32 are called “warps” in CUDA terminology, and the warp is the basic scheduling unit). Other active threads are either queued for execution, or stalled for pending memory transactions or instruction fetches. All of this is discussed in some detail in chapters 4 and 5 of the programming guide.

Hmm I’m going to quote the paper where I read this from. So is the paper wrong? Or each block runs on a single stream-processor?

"For example, NVIDIAâ€™s 8800 GTX GPUâ€“the GPU used in this workâ€“has

16 multiprocessors, each of which supports 768 concurrent execution threads. Combined,

these multiprocessors allow the GPU to manage over 12,000 concurrent execution

threads."

Later on it says:

This one-to-one mapping of threads-to-records lets large amounts of data with 12,000 concurrent parallel operations at any one time.

avidday · January 29, 2010, 6:56am

That quotation doesn’t contradict a word I wrote. On compute capability 1.1 hardware, the limit is 768 active/concurrent threads per multiprocessor (not per block like you wrote), which are executed in SIMD fashion in warps of 32 threads. Up to 512 threads per block, maximum of 8 active/concurrent blocks per multiprocessor, with a total of 8192 registers and 16kb share memory per mutliprcoessor. Blocks are scheduled at the multiprocessor level, threads at the streaming processor level, but in groups (“warps”).

This all comes straight from appendix A of the programming guide. I invite you to read it for yourself if you are still skeptical.

Topic		Replies	Views
number of threads and registers CUDA Programming and Performance	10	4866	March 14, 2008
Lots of Threads vs. Shared Memory CUDA Programming and Performance	9	8350	February 12, 2008
threads per block / multi processor, contradiction ? CUDA Programming and Performance	5	1656	January 23, 2009
Ideal number of thread per bloc CUDA Programming and Performance	9	3409	February 5, 2008
How to decide the optimal block size in CUDA CUDA Programming and Performance	4	27648	February 15, 2010
Architecture Questions CUDA Programming and Performance	6	8170	February 12, 2008
2 blocks versus 3 blocks CUDA Programming and Performance	5	4917	August 3, 2009
Distribution of Threads to Multiprocessors CUDA Programming and Performance	8	13609	June 8, 2011
Max no. of threads in a multiprocessor. CUDA Programming and Performance	4	1693	September 29, 2009
Threads vs Blocks How does one achieve maximum parallelism? CUDA Programming and Performance	1	1020	April 2, 2010

finding the best number of threads per block

Related topics