questions about the NVIDIA programming model and GPU architecture newbie in here....

Hi :ph34r:

In reference to the document:…g_Guide_2.0.pdf

Page 5

I am having troubles to understand how threads are run. for instance in the code below, since the function is a kernel function, the code will be executed by all the theads launched by the system. My understanding is that the compiler should

associate for each thread its input data and possibly define where results should be stored. as such, the instruction int i = threadIdx.x; associate the input data i to the thread of index X, or should I rather say assign the value of threadIdx.x to the variable i??? in this context what the instruction c[i]=a[i]+b[i] really means as far as the data-thread dependence is concerned??

[codebox]global void vecAdd(float* A, float* B, float* C)


int i = threadIdx.x;

C[i] = A[i] + B[i];


int main()


// Kernel invocation

vecAdd<<<1, N>>>(A, B, C);


Page 6-7

I am having problems to understand how the expression of i and j are deduced in the code below (see codbox) i.e.

int i = blockIdx.x * blockDim.x + threadIdx.x;

int j = blockIdx.y * blockDim.y + threadIdx.y;

are the threads within any block given the Ids in the range ([0…X,0…Y]), where blockdim=[X+1,Y+1]; or instead

[0…(blockIdx.x * blockDim.x + threadIdx.x),0…(blockIdx.y * blockDim.y + threadIdx.y)]?

i mean if we have a grid of two blocks, each of 2x2 threads. will the indices of the thread be:


(0,0) (1,0) | (2,0) (3,0)

(0,1) (1,1) | (2,1) (3,1)


or instead


(0,0) (1,0) | (0,0) (1,0)

(0,1) (1,1) | (0,1) (1,1)


and hence what is the role of blockIdx.x * blockDim.x or blockIdx.y * blockDim.y

[codebox]global void matAdd(float A[N][N], float B[N][N],

float C[N][N])

Chapter 3: Hardware Implementation

CUDA Programming Guide Version 2.0 7


int i = blockIdx.x * blockDim.x + threadIdx.x;

int j = blockIdx.y * blockDim.y + threadIdx.y;

if (i < N && j < N)

C[i][j] = A[i][j] + B[i][j];


Page 11

can the code in the CPU and GPU be interleaved? if not why? and what effect does a time sharing operating system have on the execution of ourGPU application especially if our application is interrupted say by a higher priority application

page 13, section 3.1

I quote: The threads of a thread block execute concurrently on one multiprocessor. As thread blocks terminate, new blocks are launched on the vacated multiprocessors."

i am somewhat confused after reading this. can the blocks execution be interleaved within the SM? or should a block terminates its execution first before a new one is launched? the above paragraph tends to confirm the second which is confusing me as i read that each SM can accept and schedule up to 768 threads per once.

Thanks for your help :thumbup:

in your first example you’ve an array range 0…N
you’ve N threads working simultaneously and each thread has distinct Id between 0 and N. In this case, block dimension is N and block.Id is 0.

in your second example exist a little different.
for example, you have a matrix (51x51) and your blockDim (5x5). So you need 11 blockDim!! But when you write int i = blockIdx.x * blockDim.x + threadIdx.x the variable ‘i’ can be blockIdx.x (10) * blockDim.x (5) + threadIdx.x (4) = 54!! So is bigger than your matrix size! So if you put this conditional “if (i < N && j < N)” you haven’t got any problem.

if you don’t understand or you have more question ask me

Many thanks ‘tatou1234’ for you message. However, your answer did not respond exactly to my questions, see above

There is no “association” of input data. CUDA is not like DirectX or a “streaming” architecture. It’s simpler. What you see is simply assignment. (In DirectX, each thread got its own special bundle of data. Not so in CUDA. Each thread gets identical parameters, and has to fetch all other input from memory. To do this it takes its position in the grid and figures out where the data is based on that. In the example you gave, the data is layed out very simply.)

What you’ve got is the index of the thread (this starts from 0 for each new block) and the index of the block. There is no variable that has the overall index of the thread in terms of the whole grid. You have to combine thread index and block index to get the overall, unique index. i and j are the overall indexes. (whether you need an “overall index” depends on your algorithm, of course.)

Yes, code can be interleaved. CPU and GPU operate independently, and rejoin only at specific events (such as calling cudaThreadSynchronize or doing cudaMemcpy). If your CPU thread is displaced by another one by the OS, then the worst that will happen is that when the CPU thread returns to activity, the GPU may have finished its work a while ago.

Yes, often blocks can be interleaved on one SM. This depends on the resources (registers, smem) that one block needs. If two blocks can fit, two will run. When one of those finishes, another takes its place.