Hi guys. I’m new to Cuda, having first been trying out OpenCL. I have a polygon intersection routine, that tests whether two polygons are intersecting one another, touching (outside) or one polygon inside another. I do these tests for a large number of offsets also. So the number of calculations i’m doing can be approaching high numbers (hence my idea of using the GPU). It’s meant to be a relatively easy introduction into GPU computing, to get my head around the concepts.

The pseudo c++ code is along the lines of:

```
// where polygon and offsets is a class/struct containing an array of Point2ds
struct Point2d
{
double x, y;
};
for (int i = 0; i < polygon1.NumPoints()-1; i++)
{
Point2d p1 = polygon1[i];
Point2d p2 = polygon1[i+1];
for (int j = 0; j < polygon2.NumPoints()-1; j++)
{
Point2d q1 = polygon2[j];
Point2d q2 = polygon2[j+1];
for (int k = 0; k < offsets.NumPoints(); k++)
{
Point2d offset = offsets[k];
q1.TranslateBy(offset);
q2.TranslateBy(offset);
BOOL intersects = IntersectTest(p1, p2, q1, q2);
// etc
}
}
}
```

and my cuda code is something along the lines of:

```
__global__ void IntersectionKernel(double2 *polygon1, double2 *polygon2, double2 *offsets,
int numPoints1, int numPoints2, int numOffsets,
int *results)
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
int j = blockIdx.y * blockDim.y + threadIdx.y;
int k = blockIdx.z * blockDim.z + threadIdx.z;
if (i < numPoints1-1 && j < numPoints2-1 && k < numOffsets)
{
double p1[2] = {polygon1[i].x, polygon1[i+1].y};
double p2[2] = {polygon1[i+1].x, polygon1[i+1].y};
double q1[2] = {polygon2[j].x, polygon2[j+1].y};
double q2[2] = {polygon2[j+1].x, polygon2[j+1].y};
double offset[2] = {offsets[k].x, offsets[k].y};
q1.x += offset.x; q1.y += offset.y;
q2.x += offset.x; q2.y += offset.y;
BOOL intersects = IntersectTest(p1, p2, q1, q2);
// blah blah
results[k] = some_result;
}
}
```

(please ignore any typos - without a preview button, it’s difficult to make sure i haven’t made any silly mistakes)

So - one of the problem i face, is that these polygons and offsets can be of any size. That is, polygon1 may contain 57 points, polygon2 has 611 points, and there can be 113 offsets, say. Whenever i look at examples, they are always nice numbers - 512, 64, 1024 etc.

I have compute capability 2.1

I’m not necessarily looking for the absolute optimum method (although please feel to show me!), but some good way of inputting said values into Cuda code and executing.

So, my question is - how do i execute this code. I can initialise, copy and free up memory. But it’s executing the kernel that i need help with, choosing the block / grid size.

Let’s take my above example:

numPoints1 = 57

numPoints2 = 611

numOffsets = 113

```
// now execute the kernel
// what numbers do i use here?
dim3 block, grid;
IntersectionKernel
<<< block, grid /* <<----- is this even correct???? */
>>> (polygon1, polygon2, offsets, numPoints1, numPoints2, numOffsets, results);
```

I of course don’t want to run each calculation one at a time, but would like to run a reasonable amount in parallel.

Any help on this would be really appreciated.