confused with some basic stuffs

I’m new here. And I’m confused with some basic stuffs.

1, a device is a GPU? and multiple devices are many GPUs?
2, GeForce 8800 Ultra has 16 multiprocessors and 128 stream processor, so per multiprocessor has 8 stream processor? is the concept of stream processor the same as core in CPU?
3, a kernel executes a grid of threads, and a grid is consist of many blocks. A block can be executed in only one multiprocessor, is it possible several blocks be executed in the same multiprocessor?

thux

  1. Yes. Device == GPU.

  2. Yes, each multiprocessor is made up of 8 processors on current architectures. The concept of a processor here is quite different to a CPU core, since each of the processors of a given multiprocessor on the GPU is usually running the same instruction but on different data (SIMD). CPUs can run different code on each core.

  3. Yes, a multiprocessor can execute multiple blocks concurrently, and this is advantageous for performance - when one block stalls waiting for a global memory read or due to thread synchronization, the multiprocessor can switch and execute a different block instead.

Is this all not adequately explained in the programming guide?

I have some other questions:

  • Each kernel should define its grid structure and it is fixed when kernel is executed. Is that right

  • The kernel will be defined though __global__function, so if i have 2 global functions (like a process data in 2 step), that mean i have 2 kernel, i can define different grid-block structure for each one. Is that right

  • Can i reused the result of the first kernel for the second one so that i don’t have to copy result back to the main(host) memory and reinitilize the memory for the second kernel (global function)

  • I have confusion with the call of global function

Func<<<Dg, Db, Ns>>> (parametter)

while in the spec Dg, Db is dim3 but i alway see in the code something like

(alignedTypes sample code)

testKernel<<<64, 256>>>(

    (TData *)d_odata,

    (TData *)d_idata,

    numElements

);

so Dg ? Db = ? and Ns = ?

Dg = dim(64, 64, 64) or dim (64, 1, 1)

It seem to me that , from the discussion

http://forums.nvidia.com/index.php?showtopic=41519

we have a unique way to compute the thread_id. So why don’t we have an inline, or
optimized function to compute this, and that can work more efficient and reduce confusion and error that may arise with a customized function.

  • yes

  • yes

  • yes, simply use the same pointers to the global memory:

function1 <<< Dg1, Db1, Ns1 >>> ((data*)buffer1, (data*)buffer2);

function2 <<< Dg2, Db2, Ns2 >>> ((data*)buffer2, (data*)buffer1);

  • dim3 can be initialized with 1,2 or 3 arguments, others default to 1. Ns is optional and defaults to 0, so:

<<<64,256>>> means Dg = dim(64,1,1); Db = dim(256,1,1); Ns = 0;