Threads begginer question

Ok…I read NVIDIA’s CUDA guide and I know how threads are grouped and stuff like that. The thing that confuses me is thread’s and block’s indicies (threadIDx.x etc…) and the number of threads that I have to use?! Give me example, please, how can I add 2 arrays or matricies. I have experience in GPGPU using OpenGL + cg…
Thank you!

guide has an example, which is good;

sdk samples are all worth read.

take your time to read them all :)

Please, please, please write me kernel (pseudo code will be fine) for adding 2 arrays. In OpenGL/cg I had texture indicies and I don’t know how to do same thing with CUDA. I don’t understand things like: C[threadIDx.x * blockIDx.x + blockDim.x] Why like this? I didn’t find explained what are values of block’s/thread’s ID’s…

Thank you

Take a look at this thread:

Ok…This is what bugs me:

global void add_arrays_gpu( float *in1, float *in2, float *out, int Ntot)


   int idx=blockIdx.x*blockDim.x+threadIdx.x;

   if ( idx <Ntot )



int idx=blockIdx.x*blockDim.x+threadIdx.x;

idx covers all elemetns from arrays. Let’s asumme that we have arrays of 16 members so values of idx will be 1 2 3 4 5…16? Am I right? Can u explain me why are u doing this blockIdx.x*blockDim.x+threadIdx.x?

I am so stucked with this…:(

You need to map from a local index to a global index. You know how many blocks you have and how big each block is.

Let’s assume you have an array of 8 elements, and you are using 2 blocks with 5 threads each.

Block 0:
threadIdx.x= 0,1,2,3,4
idx will span: 0,1,2,3,4

Block 1:
threadIdx.x= 0,1,2,3,4
idx will span: 5,6,7,8,9

So idx is covering the initial range plus some (and there is a check in the kernel to see if idx is outside the initial range).

Oh…Thanks a lot man! It is so much clearer now! 1 more question: What are values of threadIDx.y and blockIdx.y?

Edit: What is local index and global index? thread index and block’s/grid’s index?

This example is using a 1D decomposition, so threadIdx.y and blockIdx.y are both zero.

I am using global to refer to the index of the original problem, and local to refer to the index in the decomposed problem.

Tell me this:

If I have 64 blocks and 256 threads I have 4 thread in each block and have allocated arrays like this:

CUDA_SAFE_CALL(cudaMalloc((void**)&dinput, sizeof(float) * NUM_THREADS * 2));

CUDA_SAFE_CALL(cudaMalloc((void**)&doutput, sizeof(float) * NUM_BLOCKS));

CUDA_SAFE_CALL(cudaMalloc((void**)&dtimer, sizeof(clock_t) * NUM_BLOCKS * 2));

blockIDx.x = 0

threadIDx.x = 0 1 2 3

blockIDx.x = 1

threadIDx.x = 0 1 2 3

blockIDx.x = 2

threadIDx.x = 0 1 2 3

blockIDx.x = 3

threadIDx.x = 0 1 2 3






blockIDx.x = 63

threadIDx.x = 0 1 2 3

Is this correct???

What is good example of using threadIDx.y and blockIDx.y values in kernel? (SDK sample or some else)

Edit: …sample with 2d composition