A question about the CUDA's thread parallelization

I am newbie of CUDA. I have a question about how much thread parallelization CUDA can actually provide.

On the chapter 5.1 of programming guide, it says:

The maximum number of threads that can run concurrently on a multiprocessor is 768.

But it also says: The warp size is 32 threads for 8800 series. And the definition of warp is (see chapter 3.2): Each block is split into SIMD groups of threads called warps. And SIMD means: at any given clock gycles, each processor of the multiprocessor executes the same instruction, but operates on different data.

So I got a little confused. Could anybody answer this question to straight it up:

If I have 768 threads which do the same job on different data and somehow I manage them to be excuted on one multiprocessor, then they will be excuted concurrently, or divided into 24 warps, and excuted in sequence?


All threads in a threadblock will be grouped into warps. The order of warp issue is not defined, but you can synchronize with the __syncthreads() function.

One more thing. Even though a multiprocessor may have up to 768 threads active, the maximum number of threads per threadblock is 512 (leading to 66.67% utilization). To get to the maximum utilization (768 threads), you’d have to assign more than one threadblock per multiprocessor (for example, 3 blocks with 256 threads each). All that is affected by how much shared memory and registers your kernel requires (play with the occupancy calculator for that).



Thanks for you answer!

My understanding is that more than one warp can be concurrently excuted on one multiprocessor if these warps do the same job. Therefore, by mapping more than one threadblocks on one multiprocessor, theoraticlly 768 threads could be concurrently excuted on one multiprocessor.

En, in this way, CUDA would be really cool.


That’s true in the same sense that 100 programs can execute concurrently on a single-core CPU - they are all active but they get scheduled for actual execution by the OS (using whichever method for preempting etc.).



Your last reply made it a confusion again.

OS schedule multiple task in a time-slicing fashion, and I know that CUDA multiprocessor use the same fashion if it can not host all the warps in one shot. But if your task contains less than 768 threads, which can be host by one multiprocessor in one shot, all these 768 threads should be excuted CONCURRENTLY.

CUDA documentation use the term “CONCURRENTLY”, which I assume means hardwarely parallelization instead of time-slicing fashion. Do you think so?

Let’s put it straight forward:

If you have CUDA device with 16 multiprocessors (warp size is 32) and each multiprocessor can host 768 threads concurrently. Does it means that the CUDA device could theoratically offer 12288(16x768) speed-up factor, or 512(16x32) speed-up factor?

Which one do you think is correct?


Speedup depends both on hardware and the application. Also, it depends on what you are measuring the speedup against (Pentium4, Core2, etc.). For example, observed speedups over Core2 implementations for different algorithms vary from 10x to 200x. It generally is not correct to assume that the speedup will be equal to the number of threads.

G80 hardware has 128 processing elements (16 multiprocessors, 8 processing elements each).

Think of it in this way: a multiprocessor is executing an entire warp of threads concurrently (at the same time), with other warps waiting for their turn at being executed. So, 32 threads are actively executing at the moment, while up to (768-32=736) threads are active but waiting.

From pure work point of view, there would be no benefit from using more than a warp of threads. However, having more threads improves performance by scheduling active warp execution to hide latency. This helps when accessing global memory and dealing with read-after-write register dependencies, among other cases (Performance Guidelines chapter has the details).



Thanks for replying. Now I think I understand it a lot better.

You mention that each multiprocessor has 8 processing elements. Can you explain what is “processing element” refering to? Is it something like an ALU?

By the way, I am in the field of computer vision. And we’ve been looking at high performance parallel computing platform for CV processing for a long time. We tried FPGA solution, DSP solution, vector CPU solution. Each has their advantages, as well as limitations. Now I found CUDA might be another alternative which is easier to use and relatively faster. We have ordered two device and exploring the performance. I wonder is there any workshop or web-cast class from which we can get more inside knowledge about this technology?


Yes, basically they are ALUs that do all the work.

CUDA and hardware for which CUDA is supported should be a very good match for CV applications. Speedups are really good for the straightforward things, such as edge detection, etc. I believe you will find a few vision-related samples (image convolutions, histogram, denoising, etc) in the CUDA SDK.

One good way to learn some details of CUDA use would be to look through the slides of the ECE 498 course at UIUC. Also, look at the SDK samples and accompanying whitepapers.


The manual uses the term “concurrency” in several places. Sometimes it refers to concurrency observed/experienced by the programmer, sometimes for actual parallel ALUs.

Hardware concurrency is 8 ALUs per multiprocessor. Everything else is scheduled. As a programmer however you only have the block layout specification and the __syncthreads() command to influence execution. A block can have max 512 threads. So as a programmer you have to account for concurrency of 512 max as you cannot suspend/resume individual threads. Each multiprocessor can schedule up to 8 blocks. All blocks running on one multiprocessor may only have 768 threads together.


I’ve been always told that parallel is not the same than concurrent. Concurrent imply resource sharing, parallel imply simultaneous resource sharing. I suppose that the one who wrote the manual uses this terminology too.

number of threads per block can be 512. But when I was executing my code with following configuration

func1 < 139, 512 > ( populate an array of size 512 with 1)
the code didn’t execute at all, giving out some garbage value in the array.
but when i ran
func1 < 139, 256 > ( populate an array of size 256 with 1)
I got proper output…

my intention is to populate an array of size (139 x 8000) = (rows x cols) and i want to spawn total 139 x 8000 = 1,112,000 threads is tht possible?

How many registers does your kernel require per thread (run nvcc with --ptxas-options=-v flag to see)? It is possible that there are not enough registers available to run a block with 512 threads.

If you fill in the values you see with the flag above into the occupancy calculator it will probably give you 0% occupancy