Hi :ph34r:
In reference to the document:
http://developer.download.nvidia.com/compu…g_Guide_2.0.pdf
Page 5
I am having troubles to understand how threads are run. for instance in the code below, since the function is a kernel function, the code will be executed by all the theads launched by the system. My understanding is that the compiler should
associate for each thread its input data and possibly define where results should be stored. as such, the instruction int i = threadIdx.x; associate the input data i to the thread of index X, or should I rather say assign the value of threadIdx.x to the variable i??? in this context what the instruction c[i]=a[i]+b[i] really means as far as the data-thread dependence is concerned??
[codebox]global void vecAdd(float* A, float* B, float* C)
{
int i = threadIdx.x;
C[i] = A[i] + B[i];
}
int main()
{
// Kernel invocation
vecAdd<<<1, N>>>(A, B, C);
}[/codebox]
Page 6-7
I am having problems to understand how the expression of i and j are deduced in the code below (see codbox) i.e.
int i = blockIdx.x * blockDim.x + threadIdx.x;
int j = blockIdx.y * blockDim.y + threadIdx.y;
are the threads within any block given the Ids in the range ([0…X,0…Y]), where blockdim=[X+1,Y+1]; or instead
[0…(blockIdx.x * blockDim.x + threadIdx.x),0…(blockIdx.y * blockDim.y + threadIdx.y)]?
i mean if we have a grid of two blocks, each of 2x2 threads. will the indices of the thread be:
------------+
(0,0) (1,0) | (2,0) (3,0)
(0,1) (1,1) | (2,1) (3,1)
------------+
or instead
------------+
(0,0) (1,0) | (0,0) (1,0)
(0,1) (1,1) | (0,1) (1,1)
------------+
and hence what is the role of blockIdx.x * blockDim.x or blockIdx.y * blockDim.y
[codebox]global void matAdd(float A[N][N], float B[N][N],
float C[N][N])
Chapter 3: Hardware Implementation
CUDA Programming Guide Version 2.0 7
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
int j = blockIdx.y * blockDim.y + threadIdx.y;
if (i < N && j < N)
C[i][j] = A[i][j] + B[i][j];
}[/codebox]
Page 11
can the code in the CPU and GPU be interleaved? if not why? and what effect does a time sharing operating system have on the execution of ourGPU application especially if our application is interrupted say by a higher priority application
page 13, section 3.1
I quote: The threads of a thread block execute concurrently on one multiprocessor. As thread blocks terminate, new blocks are launched on the vacated multiprocessors."
i am somewhat confused after reading this. can the blocks execution be interleaved within the SM? or should a block terminates its execution first before a new one is launched? the above paragraph tends to confirm the second which is confusing me as i read that each SM can accept and schedule up to 768 threads per once.
Thanks for your help External Image