ggeo
April 25, 2014, 12:17pm
1
Hello , I am a little confused right now and I wanted if someone can give me a code example where each thread processing one element and each thread processing more elements.
For example ,
int index = threadIdx + blockDimx * blockIdx;
result[index] = a[index] + b[index];
here each thread makes an addition and computes one element ,right?
while (index < NbElements) {
result[index] = a[index] + b[index]
index += gridDim.x * blockDim.x;
}
here?What is going on?
Also, when we use the one or the other option ? ( each thread one element or more)?
When we are out of resources?
Can you give me an example on all these?
Thank you very much!
In the second case each thread computes NbElements/(gridDim.x * blockDim.x) elements. The loop will go through elements which are gridDim.x * blockDim.x apart from each other.
ggeo
April 25, 2014, 1:51pm
3
Lets say that we have 100 elements and 10 threads.
I can’t understand this:
We start from thread 0
result[0] = a[0] + b[0]
index = 10
So, in the next loop :
result[10] = a[10] + b[10]
index = 20
…
result[90] = a[90] + b[90]
index = 100
How each thread computes 10 elements??And also ,we have 10 threads.The 90?
Thanks
The variable index is local to each thread. If you have 10 threads. Than the gridDim.x * blockDim.x=10 and you have:
First iteration
thread 0 result[0] = a[0] + b[0]; index=10
thread 1 result[1] = a[1] + b[1]; index=11
thread 2 result[2] = a[2] + b[2]; index=12
thread 3 result[3] = a[3] + b[3]
thread 4 result[4] = a[4] + b[4]
thread 5 result[5] = a[5] + b[5]
thread 6 result[6] = a[6] + b[6]
thread 7 result[7] = a[7] + b[7]
thread 8 result[8] = a[8] + b[8]
thread 9 result[9] = a[9] + b[9]; index =19
Second interation:
thread 0 result[10] = a[10] + b[10]; index=10
thread 1 result[11] = a[11] + b[11]; index=11
thread 2 result[12] = a[12] + b[12]; index =12
thread 3 result[13] = a[13] + b[13]
thread 4 result[14] = a[14] + b[14]
thread 5 result[15] = a[15] + b[15]
thread 6 result[16] = a[16] + b[16]
thread 7 result[17] = a[17] + b[17]
thread 8 result[18] = a[18] + b[18]
thread 9 result[19] = a[19] + b[19]; index =19
Late edit, fixed!
ggeo
April 25, 2014, 2:51pm
5
I think in the second loop the indices must be :
index = 20
index = 21
index = 22
…
index = 29
Also, we have 10 threads but I can see thread[19] for example…I am confused.
Finally , I can’t see how each thread computes 10 elements.I can see each thread computes one element!
Perhaps you could modify this array-based example to help understand how the threads end up doing work:
[url]https://code.google.com/p/stanford-cs193g-sp2010/wiki/TutorialMultidimensionalKernelLaunch[/url]
The CUDA by Example book has an example that uses that approach in the context of calculating dot products in Chapter 5.
Edit: here are some examples of that modality:
[url]http://www.math.uaa.alaska.edu/~afkjm/cs448/handouts/cuda-shared-sync.pptx[/url]
ggeo
April 28, 2014, 7:23am
7
Ok about these links ,thanks ,but I can’t understand correctly the above I wrote.
ggeo:
I think in the second loop the indices must be :
index = 20
index = 21
index = 22
…
index = 29
Also, we have 10 threads but I can see thread[19] for example…I am confused.
Finally , I can’t see how each thread computes 10 elements.I can see each thread computes one element!
You are write about thread indeces. I changed them by mistake. My example works for 10 threads 100 elements.
ggeo
April 29, 2014, 1:26pm
9
Ok , thanks I understood now.
Just ,when you use the one case and when the other?
It is a matter of resources?
In practice more threads means better performance, but each problem (application) is unique in its own way and one would have to test both cases.