An example ' each thread access one or more elements '

ggeo · April 25, 2014, 12:17pm

Hello , I am a little confused right now and I wanted if someone can give me a code example where each thread processing one element and each thread processing more elements.

For example ,

int index = threadIdx + blockDimx * blockIdx;
result[index] = a[index] + b[index];

here each thread makes an addition and computes one element ,right?

while (index < NbElements) {

 result[index] = a[index] + b[index]

index += gridDim.x * blockDim.x;
}

here?What is going on?

Also, when we use the one or the other option ? ( each thread one element or more)?
When we are out of resources?
Can you give me an example on all these?

Thank you very much!

pasoleatis · April 25, 2014, 12:52pm

In the second case each thread computes NbElements/(gridDim.x * blockDim.x) elements. The loop will go through elements which are gridDim.x * blockDim.x apart from each other.

ggeo · April 25, 2014, 1:51pm

Lets say that we have 100 elements and 10 threads.

I can’t understand this:

We start from thread 0

result[0] = a[0] + b[0]
index = 10

So, in the next loop :

result[10] = a[10] + b[10]
index = 20

…

result[90] = a[90] + b[90]
index = 100

How each thread computes 10 elements??And also ,we have 10 threads.The 90?

Thanks

pasoleatis · April 25, 2014, 2:19pm

The variable index is local to each thread. If you have 10 threads. Than the gridDim.x * blockDim.x=10 and you have:

First iteration
thread 0 result[0] = a[0] + b[0]; index=10
thread 1 result[1] = a[1] + b[1]; index=11
thread 2 result[2] = a[2] + b[2]; index=12
thread 3 result[3] = a[3] + b[3]
thread 4 result[4] = a[4] + b[4]
thread 5 result[5] = a[5] + b[5]
thread 6 result[6] = a[6] + b[6]
thread 7 result[7] = a[7] + b[7]
thread 8 result[8] = a[8] + b[8]
thread 9 result[9] = a[9] + b[9]; index =19
Second interation:
thread 0 result[10] = a[10] + b[10]; index=10
thread 1 result[11] = a[11] + b[11]; index=11
thread 2 result[12] = a[12] + b[12]; index =12
thread 3 result[13] = a[13] + b[13]
thread 4 result[14] = a[14] + b[14]
thread 5 result[15] = a[15] + b[15]
thread 6 result[16] = a[16] + b[16]
thread 7 result[17] = a[17] + b[17]
thread 8 result[18] = a[18] + b[18]
thread 9 result[19] = a[19] + b[19]; index =19

Late edit, fixed!

ggeo · April 25, 2014, 2:51pm

I think in the second loop the indices must be :

index = 20
index = 21
index = 22
…
index = 29

Also, we have 10 threads but I can see thread[19] for example…I am confused.

Finally , I can’t see how each thread computes 10 elements.I can see each thread computes one element!

vacaloca · April 26, 2014, 9:53pm

Perhaps you could modify this array-based example to help understand how the threads end up doing work:

[url]https://code.google.com/p/stanford-cs193g-sp2010/wiki/TutorialMultidimensionalKernelLaunch[/url]

The CUDA by Example book has an example that uses that approach in the context of calculating dot products in Chapter 5.

Edit: here are some examples of that modality:
[url]http://www.math.uaa.alaska.edu/~afkjm/cs448/handouts/cuda-shared-sync.pptx[/url]

ggeo · April 28, 2014, 7:23am

Ok about these links ,thanks ,but I can’t understand correctly the above I wrote.

pasoleatis · April 28, 2014, 9:17pm

You are write about thread indeces. I changed them by mistake. My example works for 10 threads 100 elements.

ggeo · April 29, 2014, 1:26pm

Ok , thanks I understood now.

Just ,when you use the one case and when the other?

It is a matter of resources?

pasoleatis · April 29, 2014, 11:47pm

In practice more threads means better performance, but each problem (application) is unique in its own way and one would have to test both cases.