questions about the NVIDIA programming model and GPU architecture newbie in here....

kari · November 10, 2008, 12:56pm

Hi :ph34r:

In reference to the document:

http://developer.download.nvidia.com/compu…g_Guide_2.0.pdf

Page 5

I am having troubles to understand how threads are run. for instance in the code below, since the function is a kernel function, the code will be executed by all the theads launched by the system. My understanding is that the compiler should

associate for each thread its input data and possibly define where results should be stored. as such, the instruction int i = threadIdx.x; associate the input data i to the thread of index X, or should I rather say assign the value of threadIdx.x to the variable i??? in this context what the instruction c[i]=a[i]+b[i] really means as far as the data-thread dependence is concerned??

[codebox]global void vecAdd(float* A, float* B, float* C)

{

int i = threadIdx.x;

C[i] = A[i] + B[i];

}

int main()

{

// Kernel invocation

vecAdd<<<1, N>>>(A, B, C);

}[/codebox]

Page 6-7

I am having problems to understand how the expression of i and j are deduced in the code below (see codbox) i.e.

int i = blockIdx.x * blockDim.x + threadIdx.x;

int j = blockIdx.y * blockDim.y + threadIdx.y;

are the threads within any block given the Ids in the range ([0…X,0…Y]), where blockdim=[X+1,Y+1]; or instead

[0…(blockIdx.x * blockDim.x + threadIdx.x),0…(blockIdx.y * blockDim.y + threadIdx.y)]?

i mean if we have a grid of two blocks, each of 2x2 threads. will the indices of the thread be:

------------+

(0,0) (1,0) | (2,0) (3,0)

(0,1) (1,1) | (2,1) (3,1)

------------+

or instead

------------+

(0,0) (1,0) | (0,0) (1,0)

(0,1) (1,1) | (0,1) (1,1)

------------+

and hence what is the role of blockIdx.x * blockDim.x or blockIdx.y * blockDim.y

[codebox]global void matAdd(float A[N][N], float B[N][N],

float C[N][N])

Chapter 3: Hardware Implementation

CUDA Programming Guide Version 2.0 7

{

int i = blockIdx.x * blockDim.x + threadIdx.x;

int j = blockIdx.y * blockDim.y + threadIdx.y;

if (i < N && j < N)

C[i][j] = A[i][j] + B[i][j];

}[/codebox]

Page 11

can the code in the CPU and GPU be interleaved? if not why? and what effect does a time sharing operating system have on the execution of ourGPU application especially if our application is interrupted say by a higher priority application

page 13, section 3.1

I quote: The threads of a thread block execute concurrently on one multiprocessor. As thread blocks terminate, new blocks are launched on the vacated multiprocessors."

i am somewhat confused after reading this. can the blocks execution be interleaved within the SM? or should a block terminates its execution first before a new one is launched? the above paragraph tends to confirm the second which is confusing me as i read that each SM can accept and schedule up to 768 threads per once.

Thanks for your help External Image

tatou1234 · November 10, 2008, 1:35pm

in your first example you’ve an array range 0…N
you’ve N threads working simultaneously and each thread has distinct Id between 0 and N. In this case, block dimension is N and block.Id is 0.

in your second example exist a little different.
for example, you have a matrix (51x51) and your blockDim (5x5). So you need 11 blockDim!! But when you write int i = blockIdx.x * blockDim.x + threadIdx.x the variable ‘i’ can be blockIdx.x (10) * blockDim.x (5) + threadIdx.x (4) = 54!! So is bigger than your matrix size! So if you put this conditional “if (i < N && j < N)” you haven’t got any problem.

if you don’t understand or you have more question ask me

kari · November 10, 2008, 3:52pm

Many thanks ‘tatou1234’ for you message. However, your answer did not respond exactly to my questions, see above

alex_dubinsky · November 10, 2008, 6:55pm

There is no “association” of input data. CUDA is not like DirectX or a “streaming” architecture. It’s simpler. What you see is simply assignment. (In DirectX, each thread got its own special bundle of data. Not so in CUDA. Each thread gets identical parameters, and has to fetch all other input from memory. To do this it takes its position in the grid and figures out where the data is based on that. In the example you gave, the data is layed out very simply.)

What you’ve got is the index of the thread (this starts from 0 for each new block) and the index of the block. There is no variable that has the overall index of the thread in terms of the whole grid. You have to combine thread index and block index to get the overall, unique index. i and j are the overall indexes. (whether you need an “overall index” depends on your algorithm, of course.)

Yes, code can be interleaved. CPU and GPU operate independently, and rejoin only at specific events (such as calling cudaThreadSynchronize or doing cudaMemcpy). If your CPU thread is displaced by another one by the OS, then the worst that will happen is that when the CPU thread returns to activity, the GPU may have finished its work a while ago.

Yes, often blocks can be interleaved on one SM. This depends on the resources (registers, smem) that one block needs. If two blocks can fit, two will run. When one of those finishes, another takes its place.

Topic		Replies	Views
performance cost of too many blocks? CUDA Programming and Performance	12	2802	December 4, 2018
General CUDA Questions New to CUDA and need some help! CUDA Programming and Performance	8	5980	September 5, 2008
Annoying problems with memory and/or syntax CUDA Programming and Performance	19	4769	April 8, 2008
Scheduling of thread blocks on Stream Processors CUDA Programming and Performance	9	11022	June 7, 2010
CUDA Pro Tip: Write Flexible Kernels with Grid-Stride Loops Technical Blog	18	1536	September 12, 2019
Where's a PTX ISA Virtual Grid ID? Special Reg %gridid is "temporal".. CUDA Programming and Performance	7	2352	January 23, 2012
Grids and Threads question CUDA Programming and Performance	2	4421	August 7, 2007
Inter-Block Dependency CUDA Programming and Performance	13	12038	January 9, 2011
the 1024 threads can work concurrently? CUDA Programming and Performance	4	847	July 24, 2017
Conceptual questions on memory and blocks limit CUDA Programming and Performance	2	716	May 29, 2018

questions about the NVIDIA programming model and GPU architecture newbie in here....

Related topics