Max # of blocks?

humorstar · November 24, 2007, 4:25pm

What is the maximum number of concurrently running blocks ?
In the occupancy.xls, it says 8 blocks/processor and 16 processors/GPU. So is the max concurrent blocks 16 x 8 = 128?

paulius · November 25, 2007, 12:09am

It depends on your card and your program. The occupancy tells you the number of blocks per multiprocessor for a particular kernel (8 is the maximum). Just multiply that number with the number of multiprocessors in your card to get what you want. Tesla C870 has 16 multiprocessors, check the programming guide for the other cards’ specs.

Paulius

humorstar · November 25, 2007, 7:11pm

Thank you. How to verify that how many blocks / threads are actually running?

The reason why I ask this is:

I have a kernel that uses 20 registers and a little bit shared memory. I think one multiprocessor can accomodate 2 blocks with this setting. Besides, the programming guide recommends two or more blocks in one multiprocessor.

So I ask for 16 x 2 = 32 blocks to run, hoping it will be able to run with full speed with all its 16 multiprocessors. But my test result is:

32 blocks x 192 threads: 1785 ms
16 blocks x 192 threads: 937 ms

Why the time in case 2) is almost the half the case 1). If all 16 multiprocessors are running, they should be similar, right?

AndreiB · November 25, 2007, 7:26pm

CUDA profiler writes actual occupancy in logfile. Check CUDA profiler readme in /doc directory.

paulius · November 25, 2007, 9:52pm

Your second case has half as many threads as the first one. Each thread gets executed by some streaming processor. So, second case is twice as much work.

Ultimately what matters is the number of threads. I would expect the times to be quite close if you compared 32 blocks x 192 threads and 16 blocks x 384 threads. Both configurations would give you 50% occupancy on a GPU with 16 multiprocessors, assuming register and smem usage allowed that.

Paulius

humorstar · November 25, 2007, 10:46pm

My understanding was:

With 20 registers per thread, if I only run about up to 192 threads per block, then 20 x 192 threads x 2 blocks are less than 8192 registers in a multiprocessor. So I thought a mutliprocessor will be able to run 2 blocks concurrently. With 16 mutliprocessors, total 32 blocks should be running concurrently. I thought this would be close to 100% occupancy?

Do you mean a processor can run one thread at a time? Does this imply that concurrently there can be only 128 threads running at the most, since we have only 128 processors (8 processors per multiprocessor x 16 multiprocessor) in a card?

humorstar · November 25, 2007, 11:03pm

I saw this CUDA profiler but where to set those macros such as CUDA_PROFILE? In the program?

From the description of CUDA profiler, which options/signals are describing the number of threads/blocks running concurrently? I did see there is an option for occupancy. Thank you all,

ertugruld · November 26, 2007, 6:04am

You have to add CUDA_PROFILE as an environment variable. If you are using Windows, it is basicly:

My Computer->Properties->Advanced->Environment Variables

Add a “CUDA_PROFILE” with 1 on both tables.

wumpus · November 26, 2007, 7:54am

Or maybe easier, launch your program from a batch file that sets

set CUDA_PROFILE=1

As you don’t want to have profiling enabled when you do timings on the complete application. Having profiling enabled adds some extra launch overhead.

paulius · November 26, 2007, 7:36pm

A multiprocessor can have 768 threads active. So, 100% occupancy would mean that you in fact hact 768 threads (number of active blocks x threads per block) active. Therefore, 2 blocks with 192 threads give you 50% occupancy. By the way, 50% occupancy generally leads to pretty good performance.

Threads get time-sliced (for free, since there are no context switches). Ultimately, there are 8 streaming processors per multiprocessor. But, instructions are pipelined and thread-switching is used to hide latency. For example, going from 32 to 768 active threads per multiprocessor results in a 7x speedup for a kernel that reads, increments, and writes back to gmem.

Paulius

humorstar · November 28, 2007, 4:55pm

Or maybe easier, launch your program from a batch file that sets
set CUDA_PROFILE=1
As you don’t want to have profiling enabled when you do timings on the complete application. Having profiling enabled adds some extra launch overhead.

[snapback]284781[/snapback]

Thanks to you all. By setting it, I was able to get a log on Windows XP.

Could you please give me an example on how to write a configuration file using “Profiler Configuration” options?

Topic		Replies	Views
Controlling Multiprocessor Usage? CUDA Programming and Performance	2	1545	April 3, 2009
Registers per thread limit and occupancy CUDA Programming and Performance	3	10211	March 30, 2007
Distribution of Threads to Multiprocessors CUDA Programming and Performance	8	13784	June 8, 2011
max number of block CUDA Programming and Performance	21	18246	April 20, 2010
how to determine max number of blocks per kernel CUDA Programming and Performance	10	17463	September 11, 2011
number of threads and registers CUDA Programming and Performance	10	5059	March 14, 2008
Maximum of threads On 8600GT CUDA Programming and Performance	6	3715	April 9, 2008
Deep understanding how block is actually processed in MP CUDA Programming and Performance	28	30671	December 15, 2010
Occupancy limiting factor = Block-Size Occupancy limit CUDA Programming and Performance	9	5750	June 5, 2011
Question about occupancy Calculator data CUDA Programming and Performance	4	2233	August 20, 2013

Max # of blocks?

Related topics