Complete Novice Question Question on the basic implementation of a kernel

Hi,

Apologies if this has been asked before, and for its complete simpleness.

i have started to read the CUDA programmers guide, and have some very basic questions :

  1. Section 2.1 describes kernels - which seem to be basic C functions. When this code on swection 2.1 is invoked rather than N iterations, there will be 1 (one) iteration multiplied by N to achieve a result ina pporx 1/N th amount of time ?.

  2. Section 2.1 states that a calculation is divided in N threads which i would assume is as in 1 (one) above N calculations being processed at the same time. In the code example, the “threadIdx” is declared, and in section 2.2 “threadIdx” are arranged into blocks. Is it correct that a block is just the collection of threads ?, and you do not have to worry how they system implements these blocks when programming ?

  3. Section 2.2 then continues to state that 1, 2 and 3 dimensional blocks have thread indexes ID calculated using the rules as per the text. Is it important to determine the thread index ID and would it be necessary to use this programming - basically is it rarely used, or something that will be used often ?.

Apologies for the basic and perhaps unusual questions - i just want to make sure that i remember the important areas before reading the document further. thanks.

Regards,

Richard.

[quot]

  1. Section 2.1 describes kernels - which seem to be basic C functions. When this code on swection 2.1 is invoked rather than N iterations,

    there will be 1 (one) iteration multiplied by N to achieve a result ina pporx 1/N th amount of time ?.

[/quot]

__global__ void VecAdd(float* A, float* B, float* C)

{

	int i = threadIdx.x;

	C[i] = A[i] + B[i];

}

int main()

{

	// Kernel invocation

	VecAdd<<<1, N>>>(A, B, C);

}

N is number of threads in a thread block, not “iteration”,

suppose size(A) = size(B) = size© = M < N, then you need to impose boundary condition (following code)

__global__ void VecAdd(float* A, float* B, float* C, int M)

{

	int i = threadIdx.x;

	if ( i < M )

		C[i] = A[i] + B[i];

	}

}

int main()

{

	// Kernel invocation

	VecAdd<<<1, N>>>(A, B, C, M);

}

Hi,

Thanks - i understand the syntax of what is being stated - that N is the number of threads in a block, but just wanted to make sure that all the N threads are calculated at the same time.

Further, the question i have is whether the threadIdx is ever used to manipulate data, or as per the example, it is just used to apply the mathematical operation in the kernel. ?

Thanks and Regards,

Richard.

hardware resource is limited, all threads cannot be executed simultanously.

but in programmer’s view, you may think that you have un-limited resource such that

all threads run simultaneously.

if you want fine-grain parallelism, say one thread deals with one data element, then you need threadIdx to

choose target data element.

Hi,

Thanks for the reply.

If N is a small number = 20, i assume that there are enough cores to run all calculations in paralell - but if N = 1000, then this may not be the case- graphics card dependent.

So i can use threadIdx to target a specific calculation/element - is this the correct way to view threadIdx, as an element or a calculation instance ?.

Thanks and Regards,

Richard.

Generally speaking, you need to invoke many threads (for example, 192 threads per block at least) to hide pipeline latency,

(please see the thread http://forums.nvidia.com/index.php?showtopic=109876)

please see example “vecadd” in section 3.2.1 of programming guide and “Matrix multiplication” example in section 3.2.2,

these two examples shows the role threadIdx plays

Hi,

Thanks for the replies - will need to do some more reading and work with the software downloaded. Your replies have been very helpful, thanks.

Regards,

Richard.