Hi everyone,
I’m going to admit I’m quite confused by all of this. I understand the premise of parallel programming, and do all of mine via openmp (so far). So, I’ve been going through the examples and I haven’t really found an answer to most of my questions. For instance, lets say I have a function
for(int i = 0; i < numpts; i++)
{
float dist = distarray[i]*2;
for(int j = 0; j < numpts2; j++)
{
nk[j].x = cos(5.0f * j)*dist;
}
}
Now, I would think to split this up as follows (based on openmp principles)
__global__ void
EDCalc( float2* nk, float* distarray, int ptsperthread, int numpts2 )
{
const unsigned int tid = threadIdx.x;
float2* nk = dnk + tid*ptsperthread;
for(int i = 0; i < ptsperthread; i++)
{
float dist = distarray[tid*ptsperthread +i];
for(int j = 0; j < numpts2; j++)
{
nk[j].x = cos(5.0f * j)*dist;
}
}
}
Where numbpts2 >> ptsperthread And I would call it by
//Memory allocation not shown
const unsigned int num_threads = 256;
dim3 grid(1, 1, 1);
dim3 threads(num_threads, 1, 1);
EDCalc<<<grid, threads >>> (nk, distarray, ptsperthread, numpts2);
Please ignore any optimizations, i’m just trying to get a proof of concept down. So in OpenMP threading, you always would optimize the outer thread, and leave the inner thread alone. My gut feeling here is that you want to somehow split the inner loops off into threads as well, but I have no idea how to go about doing that. Also, outside of setting the num_threads, how do you decide the values of the other parameters in grid and threads? For the purposes of this question, please assume the arrays are very large and the numpts2 is large. Thanks for any help.