An example ' each thread access one or more elements '

Hello , I am a little confused right now and I wanted if someone can give me a code example where each thread processing one element and each thread processing more elements.

For example ,

int index = threadIdx + blockDimx * blockIdx;
result[index] = a[index] + b[index];

here each thread makes an addition and computes one element ,right?

while (index < NbElements) {

 result[index] = a[index] + b[index]

index += gridDim.x * blockDim.x;
}

here?What is going on?

Also, when we use the one or the other option ? ( each thread one element or more)?
When we are out of resources?
Can you give me an example on all these?

Thank you very much!

In the second case each thread computes NbElements/(gridDim.x * blockDim.x) elements. The loop will go through elements which are gridDim.x * blockDim.x apart from each other.

Lets say that we have 100 elements and 10 threads.

I can’t understand this:

We start from thread 0

result[0] = a[0] + b[0]
index = 10

So, in the next loop :

result[10] = a[10] + b[10]
index = 20

result[90] = a[90] + b[90]
index = 100

How each thread computes 10 elements??And also ,we have 10 threads.The 90?

Thanks

The variable index is local to each thread. If you have 10 threads. Than the gridDim.x * blockDim.x=10 and you have:

First iteration
thread 0 result[0] = a[0] + b[0]; index=10
thread 1 result[1] = a[1] + b[1]; index=11
thread 2 result[2] = a[2] + b[2]; index=12
thread 3 result[3] = a[3] + b[3]
thread 4 result[4] = a[4] + b[4]
thread 5 result[5] = a[5] + b[5]
thread 6 result[6] = a[6] + b[6]
thread 7 result[7] = a[7] + b[7]
thread 8 result[8] = a[8] + b[8]
thread 9 result[9] = a[9] + b[9]; index =19
Second interation:
thread 0 result[10] = a[10] + b[10]; index=10
thread 1 result[11] = a[11] + b[11]; index=11
thread 2 result[12] = a[12] + b[12]; index =12
thread 3 result[13] = a[13] + b[13]
thread 4 result[14] = a[14] + b[14]
thread 5 result[15] = a[15] + b[15]
thread 6 result[16] = a[16] + b[16]
thread 7 result[17] = a[17] + b[17]
thread 8 result[18] = a[18] + b[18]
thread 9 result[19] = a[19] + b[19]; index =19

Late edit, fixed!

  1. I think in the second loop the indices must be :

index = 20
index = 21
index = 22

index = 29

Also, we have 10 threads but I can see thread[19] for example…I am confused.

  1. Finally , I can’t see how each thread computes 10 elements.I can see each thread computes one element!

Perhaps you could modify this array-based example to help understand how the threads end up doing work:

[url]https://code.google.com/p/stanford-cs193g-sp2010/wiki/TutorialMultidimensionalKernelLaunch[/url]

The CUDA by Example book has an example that uses that approach in the context of calculating dot products in Chapter 5.

Edit: here are some examples of that modality:
[url]http://www.math.uaa.alaska.edu/~afkjm/cs448/handouts/cuda-shared-sync.pptx[/url]

Ok about these links ,thanks ,but I can’t understand correctly the above I wrote.

You are write about thread indeces. I changed them by mistake. My example works for 10 threads 100 elements.

Ok , thanks I understood now.

Just ,when you use the one case and when the other?

It is a matter of resources?

In practice more threads means better performance, but each problem (application) is unique in its own way and one would have to test both cases.