Complete Novice Question Question on the basic implementation of a kernel

shadders_x · October 25, 2009, 9:30pm

Hi,

Apologies if this has been asked before, and for its complete simpleness.

i have started to read the CUDA programmers guide, and have some very basic questions :

Section 2.1 describes kernels - which seem to be basic C functions. When this code on swection 2.1 is invoked rather than N iterations, there will be 1 (one) iteration multiplied by N to achieve a result ina pporx 1/N th amount of time ?.
Section 2.1 states that a calculation is divided in N threads which i would assume is as in 1 (one) above N calculations being processed at the same time. In the code example, the “threadIdx” is declared, and in section 2.2 “threadIdx” are arranged into blocks. Is it correct that a block is just the collection of threads ?, and you do not have to worry how they system implements these blocks when programming ?
Section 2.2 then continues to state that 1, 2 and 3 dimensional blocks have thread indexes ID calculated using the rules as per the text. Is it important to determine the thread index ID and would it be necessary to use this programming - basically is it rarely used, or something that will be used often ?.

Apologies for the basic and perhaps unusual questions - i just want to make sure that i remember the important areas before reading the document further. thanks.

Regards,

Richard.

LSChien · October 26, 2009, 4:41am

[quot]

Section 2.1 describes kernels - which seem to be basic C functions. When this code on swection 2.1 is invoked rather than N iterations,

there will be 1 (one) iteration multiplied by N to achieve a result ina pporx 1/N th amount of time ?.

[/quot]

__global__ void VecAdd(float* A, float* B, float* C)

{

	int i = threadIdx.x;

	C[i] = A[i] + B[i];

}

int main()

{

	// Kernel invocation

	VecAdd<<<1, N>>>(A, B, C);

}

N is number of threads in a thread block, not “iteration”,

suppose size(A) = size(B) = size© = M < N, then you need to impose boundary condition (following code)

__global__ void VecAdd(float* A, float* B, float* C, int M)

{

	int i = threadIdx.x;

	if ( i < M )

		C[i] = A[i] + B[i];

	}

}

int main()

{

	// Kernel invocation

	VecAdd<<<1, N>>>(A, B, C, M);

}

shadders_x · October 26, 2009, 1:57pm

Hi,

Thanks - i understand the syntax of what is being stated - that N is the number of threads in a block, but just wanted to make sure that all the N threads are calculated at the same time.

Further, the question i have is whether the threadIdx is ever used to manipulate data, or as per the example, it is just used to apply the mathematical operation in the kernel. ?

Thanks and Regards,

Richard.

LSChien · October 26, 2009, 3:21pm

hardware resource is limited, all threads cannot be executed simultanously.

but in programmer’s view, you may think that you have un-limited resource such that

all threads run simultaneously.

if you want fine-grain parallelism, say one thread deals with one data element, then you need threadIdx to

choose target data element.

shadders_x · October 26, 2009, 9:22pm

Hi,

Thanks for the reply.

If N is a small number = 20, i assume that there are enough cores to run all calculations in paralell - but if N = 1000, then this may not be the case- graphics card dependent.

So i can use threadIdx to target a specific calculation/element - is this the correct way to view threadIdx, as an element or a calculation instance ?.

Thanks and Regards,

Richard.

LSChien · October 27, 2009, 1:18am

Generally speaking, you need to invoke many threads (for example, 192 threads per block at least) to hide pipeline latency,

(please see the thread http://forums.nvidia.com/index.php?showtopic=109876)

please see example “vecadd” in section 3.2.1 of programming guide and “Matrix multiplication” example in section 3.2.2,

these two examples shows the role threadIdx plays

shadders_x · October 27, 2009, 1:44pm

Hi,

Thanks for the replies - will need to do some more reading and work with the software downloaded. Your replies have been very helpful, thanks.

Regards,

Richard.

Topic		Replies	Views
Help me! CUDA Programming and Performance	5	1966	February 9, 2010
questions about the NVIDIA programming model and GPU architecture newbie in here.... CUDA Programming and Performance	3	2484	November 10, 2008
Threads, Blocks & Grid in CUDA CUDA Programming and Performance	4	10496	October 4, 2011
Threads begginer question CUDA Programming and Performance	8	8146	July 16, 2007
CUDA Threads CUDA Programming and Performance	3	4293	March 27, 2008
how does blocks use threads? CUDA Programming and Performance	1	2007	January 6, 2012
how thread function thread,cuda CUDA Programming and Performance	0	1146	September 11, 2009
Number of blocks and threads CUDA Programming and Performance	1	1011	November 30, 2011
blocks and threads CUDA Programming and Performance	3	4183	November 17, 2008
Starter questions topic! CUDA Programming and Performance	0	2935	January 21, 2009

Complete Novice Question Question on the basic implementation of a kernel

Related topics