Hi all…

I have a problem which I am trying to solve on CUDA… its related to calculating space trajectories for interplanetary stuff…

The problem is as follows:

*any bold alphabet represents a float4 vector.*[b]

[/b]

Say, I have 30720 (just for this example) pairs of two position vectors **r1 , r2** + I have bunch of other float 30720 values which go along with each of these pairs.

Now there are two function calls…

the first call takes in a pair of **r1,r2** and its associated other values and calculates a parameter called n. If this parameter is greater than namx is made equal to nmax (set by the user stored in constant memory).

Now we define a parameter *i *which ranges from* 0 to 2*n+1*

the second function takes in the value of current **r1 and r2** and the *i* parameter and calculates **V1 and V2** which are basically the velocities of the space craft. Note that i ranges from 0 to 2*n+1 hence for given r1 and r2 we can have at maximum 2*nmax+1 values of **v1 and v2**. These can be done in parallel.

In a dummy code it looks like this…

```
30720 times:
FUNCTIONA(r1,r2,bunch of other values,n)
if(n>nmax)n=nmax;
for(i=0;i<2*n+1;i++){
FUNCTIONB(<b>r1,r2</b>,bunch of other values,i,<b>V1,V2)</b>
}
```

The above code repeats for each of the 30720 pair and hence its fully parallel. But as we can notice there is one more level of parallelism we an exploit that … given the n value after functiona we can evaluate n calls to functionb also in parallel. So at maximum the parallelism is 30720*nmax. But as the n value is differnet for each pair of r1 and r2 the actual parallelism will be less but we cannot know that before we execute functiona on each pair.

Now another solution strategy I came up with was instead doing the whole above process for 30720 threads in oner kernel launch. I can do two kernel launches…

kernel a = launches 30720 threads and evalutes the functionA… stores the delta = 0 or 1 in a 30720*namx lentght of array corresposding to n<nmax

kernel b = launches 30720*namx threads and does the function B but now not all threads will compute… some threads will be left ideal when the delta[tid]=0… I suspect upto 20 % of threads mite be idle in some cases but now we do have much more parrelism to exploit. But there is a catch witht his approach… now each 5 theads have to read the same value of r1 and r2 from the gloabal memory which will destroy th coalscing as I see it… I cant find a way to preserve coalscing and still achieve this much parallelism…

So I was wondering if any of you super brains can help me out here… on making this algorithm port in a best way to the gpu ? I would really appreaciate any imptu from any one :-)

Fell free to ask if you have confusuion on the algorithm…

Thanks,

Nitin