Max # of blocks?

What is the maximum number of concurrently running blocks ?
In the occupancy.xls, it says 8 blocks/processor and 16 processors/GPU. So is the max concurrent blocks 16 x 8 = 128?

It depends on your card and your program. The occupancy tells you the number of blocks per multiprocessor for a particular kernel (8 is the maximum). Just multiply that number with the number of multiprocessors in your card to get what you want. Tesla C870 has 16 multiprocessors, check the programming guide for the other cards’ specs.


Thank you. How to verify that how many blocks / threads are actually running?

The reason why I ask this is:

I have a kernel that uses 20 registers and a little bit shared memory. I think one multiprocessor can accomodate 2 blocks with this setting. Besides, the programming guide recommends two or more blocks in one multiprocessor.

So I ask for 16 x 2 = 32 blocks to run, hoping it will be able to run with full speed with all its 16 multiprocessors. But my test result is:

  1. 32 blocks x 192 threads: 1785 ms

  2. 16 blocks x 192 threads: 937 ms

Why the time in case 2) is almost the half the case 1). If all 16 multiprocessors are running, they should be similar, right?

CUDA profiler writes actual occupancy in logfile. Check CUDA profiler readme in /doc directory.

Your second case has half as many threads as the first one. Each thread gets executed by some streaming processor. So, second case is twice as much work.

Ultimately what matters is the number of threads. I would expect the times to be quite close if you compared 32 blocks x 192 threads and 16 blocks x 384 threads. Both configurations would give you 50% occupancy on a GPU with 16 multiprocessors, assuming register and smem usage allowed that.


My understanding was:

With 20 registers per thread, if I only run about up to 192 threads per block, then 20 x 192 threads x 2 blocks are less than 8192 registers in a multiprocessor. So I thought a mutliprocessor will be able to run 2 blocks concurrently. With 16 mutliprocessors, total 32 blocks should be running concurrently. I thought this would be close to 100% occupancy?

Do you mean a processor can run one thread at a time? Does this imply that concurrently there can be only 128 threads running at the most, since we have only 128 processors (8 processors per multiprocessor x 16 multiprocessor) in a card?

I saw this CUDA profiler but where to set those macros such as CUDA_PROFILE? In the program?

From the description of CUDA profiler, which options/signals are describing the number of threads/blocks running concurrently? I did see there is an option for occupancy. Thank you all,

You have to add CUDA_PROFILE as an environment variable. If you are using Windows, it is basicly:

My Computer->Properties->Advanced->Environment Variables

Add a “CUDA_PROFILE” with 1 on both tables.

Or maybe easier, launch your program from a batch file that sets


As you don’t want to have profiling enabled when you do timings on the complete application. Having profiling enabled adds some extra launch overhead.

A multiprocessor can have 768 threads active. So, 100% occupancy would mean that you in fact hact 768 threads (number of active blocks x threads per block) active. Therefore, 2 blocks with 192 threads give you 50% occupancy. By the way, 50% occupancy generally leads to pretty good performance.

Threads get time-sliced (for free, since there are no context switches). Ultimately, there are 8 streaming processors per multiprocessor. But, instructions are pipelined and thread-switching is used to hide latency. For example, going from 32 to 768 active threads per multiprocessor results in a 7x speedup for a kernel that reads, increments, and writes back to gmem.


Thanks to you all. By setting it, I was able to get a log on Windows XP.

Could you please give me an example on how to write a configuration file using “Profiler Configuration” options?