independed threads

artemon · June 15, 2007, 6:19am

Hi Guys,

How I can execute threads with different data.

Example:

First thread: Computing PI (3,14…).
Second thread: Multiple some matrix.

P.S.
In SDK samples, I’m founf only work with ONE data and different threads.

Knaxkopp · June 15, 2007, 8:19am

What is your goal with this? G80 is not like having 128 scalar (or even 16 SIMD) cores that execute different threads at the same time, with every thread executing a different function eg different code.

G80/CUDA is for exploiting “data parallelism” using kernels. It’s more some kind of super SIMD. If you have to process a lot of (mostly independent) data with the same or nearly same caculations, you start a kernel with a lot of threads to use a lot of ALUs for this calculations. But even if there can be some divergence at block or even warp level and predication masks at thread level, the threads all execute the same code at different data. But not totally different tasks.

If “task paralellism” is what you need, you should buy some multi core CPU and use windows threads or pthreads.

Greetings

    Knax

YetAnotherNoob · June 15, 2007, 10:13am

The G80 is a vector processor; it’s good at running the same little program on large grids of data (like the above poster pointed out). There are 128 processing elements, but they’re sharing a control unit, and there is no automated caching or branch prediction (these two features make up a large percentage of any “normal” processor). Thus, it can rip through arrays of numbers at mind-blowing speeds, but will lag FAR behind the CPU on most other tasks.

For multiple tasks that are well-suited to the G80, you can use time-sharing and swap in different ‘kernels’ as they are needed.

If what you really want are a bunch of cores that are all running completely different processes, then you might look into some kind of server.

prkipfer · June 15, 2007, 10:30am

Guys, the G80 is a scalar architecture. Please do read the manual.

That is easy, just index into an array using the threads cardinality available in the device code via blockIdx and threadIdx.

Ah, you actually want to run different programs per thread. This is a totally different story.

From a software development point of view, the G80 can be understood as having pseudo vector units that scale automatically from 1 to 16 (see the manual for the gory details of divergence). That means, you can run several code paths within the same kernel doing Pi and matrix at the same time if you dedicate multiples of 16 threads to each.

However, programming such kernels is very complicated for several technical reasons (thread sync, #registers, shared mem limits, etc). It is much easier (and efficient) to produce a specialized kernel for each task and run them on several thousands of instances of the problem. If you don’t need thousands of threads computing Pi, the G80 is not for you. If you can, go ahead and reformulate your Pi algorithm such that it can use that many threads. There is an example of how to do that for matrix multiplication in the CUDA manual.

Peter

YetAnotherNoob · June 15, 2007, 7:08pm

There seems to be a lot of confusion (in general) about the definition of a “vector processor”. It does not refer to the type of 4-dimensional vector operations you see in most graphics cards. Instead, it denotes a data-parallel architecture where the same program is run on many data elements simultaneously (and independently of each other).

The Wikipedia article gives a pretty good introduction to the concept of a vector processor:

http://en.wikipedia.org/wiki/Vector_processor

pototschnig · June 17, 2007, 7:28am

You can use a switch statement in your kernel but you have to take care about that different code has to be executed in a different warp. A warp is 32 threads. Your thread gets information as threadIdx.

Something like this should work:

 int id = (threadIdx.y * threadDim.x + threadIdx.x) >> 5;

  switch (id)

  {

    case 0: /* threads 0 ...31 */

    case 1: /* threads 32 ... 63 */

    ...

  }

Please correct me if I understood something wrong :-)

reagards

Pototschnig

seibert · June 17, 2007, 11:43am

Another option is to switch on the block index, which also ensures warps will not diverge. This is the usual approach people take to writing a “fat kernel,” which is the term I’ve seen people use for a kernel which does many different tasks simultaneously. The only disadvantage to switching at the block level is that different blocks cannot communicate on the G80 (8800), and can only communicate via atomic integer operations to global memory on the G84 (8500, 8600).

Nevertheless, I agree with the other comments. The GPU is easiest to program when you can dedicate all the silicon to the one (very wide) data-parallel task at a time. Push the switching up to the CPU level if you can get away with it.

You can use a switch statement in your kernel but you have to take care about that different code has to be executed in a different warp. A warp is 32 threads. Your thread gets information as threadIdx.

Something like this should work:
 int id = (threadIdx.y * threadDim.x + threadIdx.x) >> 5;

  switch (id)

  {

    case 0: /* threads 0 ...31 */

    case 1: /* threads 32 ... 63 */

    ...

  }
Please correct me if I understood something wrong :-)

reagards

Pototschnig

[snapback]210455[/snapback]

prkipfer · June 17, 2007, 7:38pm

Yes, this is another option. It suffers however from that fact that all blocks have the same number of threads and very small blocks are inefficient.

Peter

artemon · June 18, 2007, 8:51am

Tannks

Topic		Replies	Views
CUDA Use Cases run serial algorithms on composite data CUDA Programming and Performance	14	4515	October 24, 2008
Running Just One Thread on each Core CUDA Programming and Performance	7	4592	April 28, 2011
Help using single GPU among multithreaded CPU CUDA Programming and Performance	4	1284	October 18, 2013
CUDA on G80 hardware questions... Mapping the execution model to hardware CUDA Programming and Performance	10	12412	April 10, 2008
'Computations server' application design advice CUDA Programming and Performance	24	12711	March 23, 2007
GPU pool CUDA Programming and Performance	1	1198	November 28, 2013
I want to Implement 10.000 Cores in GPU, each making an arithmetic equation, is possible to do: I wi CUDA Programming and Performance	4	1648	February 4, 2016
Parallel computing by cpu thread and gpu kernel CUDA Programming and Performance	5	1292	November 21, 2014
Run same CUDA kernel from two different host threads, with different data CUDA Programming and Performance	0	1323	August 4, 2011
How many parallel threads? CUDA Programming and Performance	19	10093	October 1, 2021

independed threads

Related topics