 # Help! I don't understand the threading model

Hi everyone,

I’m going to admit I’m quite confused by all of this. I understand the premise of parallel programming, and do all of mine via openmp (so far). So, I’ve been going through the examples and I haven’t really found an answer to most of my questions. For instance, lets say I have a function

``````for(int i = 0; i < numpts; i++)

{

float dist = distarray[i]*2;

for(int j = 0; j < numpts2; j++)

{

nk[j].x = cos(5.0f * j)*dist;

}

}
``````

Now, I would think to split this up as follows (based on openmp principles)

``````__global__ void

EDCalc( float2* nk, float* distarray, int ptsperthread, int numpts2 )

{

const unsigned int tid = threadIdx.x;

float2* nk = dnk + tid*ptsperthread;

for(int i = 0; i < ptsperthread; i++)

{

for(int j = 0; j < numpts2; j++)

{

nk[j].x = cos(5.0f * j)*dist;

}

}

}
``````

Where numbpts2 >> ptsperthread And I would call it by

``````//Memory allocation not shown

const unsigned int num_threads = 256;

dim3 grid(1, 1, 1);

``````

Please ignore any optimizations, i’m just trying to get a proof of concept down. So in OpenMP threading, you always would optimize the outer thread, and leave the inner thread alone. My gut feeling here is that you want to somehow split the inner loops off into threads as well, but I have no idea how to go about doing that. Also, outside of setting the num_threads, how do you decide the values of the other parameters in grid and threads? For the purposes of this question, please assume the arrays are very large and the numpts2 is large. Thanks for any help.

Can you clarify the goal of the original version of the example?

``````for(int i = 0; i < numpts; i++)

{

float dist = distarray[i]*2;

for(int j = 0; j < numpts2; j++)

{

nk[j].x = cos(5.0f * j)*dist;

}

}
``````

This appears to overwrite the values nk[j].x on every pass of the loop. It looks like only the final iteration in i has any effect after this function exits. (This seems picky, but I don’t want to give some crazy advice regarding this example if I don’t understand the output you want from it.)

Hi seibert, in the actual program, there is more that depends on the value of nk[j] but I’ve left that out to try and keep it simple. Thanks for any help. I did find some NVIDIA presentations which were fairly illuminating and helped quite a bit, but if you could offer your advice on this, I would very much appreciate it. Thanks

Would it be easier if I just posted a complete (small) example?

Yes, always smart to do. general observation:

• you want to use lots of blocks. So maybe have tid = blockIdx.x * blockDim.x + threadIdx.x and get rid of the first for loop.

But as said, you look to be overwriting values, so it’s best to post a complete example, so we can give advice how to spread it over the grid&blocks

Regardless of your specific example: the general threading model of CUDA is called “data parallel”. You write one code (the kernel) which is executed 10’s of thousands of times (or much more), each time on a separate piece of data.

So, the general threading model is to assign each thread to a single array element. The simplest situation is where each output can be computed independent of all the others, but there are cases where dependencies amount the outputs can be had (i.e. reduce, scan).

It is very different from OpenMP where you break a loop up into a number of chunks equal to the number of CPU cores on the system. Think of CUDA similarly, except you have one thread per array element.

And the appropriate number of threads often has nothing to do with the number of computational elements on the GPU. Operating system threads have non-trivial overhead, so you generally want # of threads ~= # of CPUs, unless most of the threads are going to blocked on I/O. In CUDA, the ideal number of threads is usually tens or hundreds of times larger than the number of stream processors. Threads have very nearly zero overhead in CUDA, which is why the number of threads scales with the amount of data, not the number of processors in your GPU.

(This is mostly because a thread in CUDA is a lot simpler than an operating system thread. A thread in CUDA consists of a “register context” and not much else. Moreover, the register allocation for a thread is persistent for the lifetime of the block, so there is no context-switching overhead like on a CPU. There is no stack, and the program counter lives at the warp level.)