independed threads

Hi Guys,

How I can execute threads with different data.


First thread: Computing PI (3,14…).
Second thread: Multiple some matrix.

In SDK samples, I’m founf only work with ONE data and different threads.

What is your goal with this? G80 is not like having 128 scalar (or even 16 SIMD) cores that execute different threads at the same time, with every thread executing a different function eg different code.

G80/CUDA is for exploiting “data parallelism” using kernels. It’s more some kind of super SIMD. If you have to process a lot of (mostly independent) data with the same or nearly same caculations, you start a kernel with a lot of threads to use a lot of ALUs for this calculations. But even if there can be some divergence at block or even warp level and predication masks at thread level, the threads all execute the same code at different data. But not totally different tasks.

If “task paralellism” is what you need, you should buy some multi core CPU and use windows threads or pthreads.



The G80 is a vector processor; it’s good at running the same little program on large grids of data (like the above poster pointed out). There are 128 processing elements, but they’re sharing a control unit, and there is no automated caching or branch prediction (these two features make up a large percentage of any “normal” processor). Thus, it can rip through arrays of numbers at mind-blowing speeds, but will lag FAR behind the CPU on most other tasks.

For multiple tasks that are well-suited to the G80, you can use time-sharing and swap in different ‘kernels’ as they are needed.

If what you really want are a bunch of cores that are all running completely different processes, then you might look into some kind of server.

Guys, the G80 is a scalar architecture. Please do read the manual.

That is easy, just index into an array using the threads cardinality available in the device code via blockIdx and threadIdx.

Ah, you actually want to run different programs per thread. This is a totally different story.

From a software development point of view, the G80 can be understood as having pseudo vector units that scale automatically from 1 to 16 (see the manual for the gory details of divergence). That means, you can run several code paths within the same kernel doing Pi and matrix at the same time if you dedicate multiples of 16 threads to each.

However, programming such kernels is very complicated for several technical reasons (thread sync, #registers, shared mem limits, etc). It is much easier (and efficient) to produce a specialized kernel for each task and run them on several thousands of instances of the problem. If you don’t need thousands of threads computing Pi, the G80 is not for you. If you can, go ahead and reformulate your Pi algorithm such that it can use that many threads. There is an example of how to do that for matrix multiplication in the CUDA manual.


There seems to be a lot of confusion (in general) about the definition of a “vector processor”. It does not refer to the type of 4-dimensional vector operations you see in most graphics cards. Instead, it denotes a data-parallel architecture where the same program is run on many data elements simultaneously (and independently of each other).

The Wikipedia article gives a pretty good introduction to the concept of a vector processor:

You can use a switch statement in your kernel but you have to take care about that different code has to be executed in a different warp. A warp is 32 threads. Your thread gets information as threadIdx.

Something like this should work:

 int id = (threadIdx.y * threadDim.x + threadIdx.x) >> 5;

  switch (id)


    case 0: /* threads 0 ...31 */

    case 1: /* threads 32 ... 63 */



Please correct me if I understood something wrong :-)



Another option is to switch on the block index, which also ensures warps will not diverge. This is the usual approach people take to writing a “fat kernel,” which is the term I’ve seen people use for a kernel which does many different tasks simultaneously. The only disadvantage to switching at the block level is that different blocks cannot communicate on the G80 (8800), and can only communicate via atomic integer operations to global memory on the G84 (8500, 8600).

Nevertheless, I agree with the other comments. The GPU is easiest to program when you can dedicate all the silicon to the one (very wide) data-parallel task at a time. Push the switching up to the CPU level if you can get away with it.

Yes, this is another option. It suffers however from that fact that all blocks have the same number of threads and very small blocks are inefficient.