Threads, blocks and grids. A beginners guide!

Hi guys. I’m new to Cuda, having first been trying out OpenCL. I have a polygon intersection routine, that tests whether two polygons are intersecting one another, touching (outside) or one polygon inside another. I do these tests for a large number of offsets also. So the number of calculations i’m doing can be approaching high numbers (hence my idea of using the GPU). It’s meant to be a relatively easy introduction into GPU computing, to get my head around the concepts.

The pseudo c++ code is along the lines of:

// where polygon and offsets is a class/struct containing an array of Point2ds

struct Point2d
  double x, y;

for (int i = 0; i < polygon1.NumPoints()-1; i++)
  Point2d p1 = polygon1[i];
  Point2d p2 = polygon1[i+1];

  for (int j = 0; j < polygon2.NumPoints()-1; j++)
    Point2d q1 = polygon2[j];
    Point2d q2 = polygon2[j+1];
    for (int k = 0; k < offsets.NumPoints(); k++)
       Point2d offset = offsets[k];

       BOOL intersects = IntersectTest(p1, p2, q1, q2);

       // etc


and my cuda code is something along the lines of:

__global__ void IntersectionKernel(double2 *polygon1, double2 *polygon2, double2 *offsets, 
                                   int numPoints1, int numPoints2, int numOffsets, 
                                   int *results)
  int i = blockIdx.x * blockDim.x + threadIdx.x;
  int j = blockIdx.y * blockDim.y + threadIdx.y;
  int k = blockIdx.z * blockDim.z + threadIdx.z;

  if (i < numPoints1-1 && j < numPoints2-1 && k < numOffsets)
    double p1[2] = {polygon1[i].x,   polygon1[i+1].y};
    double p2[2] = {polygon1[i+1].x, polygon1[i+1].y};

    double q1[2] = {polygon2[j].x,   polygon2[j+1].y};
    double q2[2] = {polygon2[j+1].x, polygon2[j+1].y};

    double offset[2] = {offsets[k].x, offsets[k].y};

    q1.x += offset.x;     q1.y += offset.y;
    q2.x += offset.x;     q2.y += offset.y;
    BOOL intersects = IntersectTest(p1, p2, q1, q2);

    // blah blah

    results[k] = some_result;


(please ignore any typos - without a preview button, it’s difficult to make sure i haven’t made any silly mistakes)

So - one of the problem i face, is that these polygons and offsets can be of any size. That is, polygon1 may contain 57 points, polygon2 has 611 points, and there can be 113 offsets, say. Whenever i look at examples, they are always nice numbers - 512, 64, 1024 etc.

I have compute capability 2.1

I’m not necessarily looking for the absolute optimum method (although please feel to show me!), but some good way of inputting said values into Cuda code and executing.

So, my question is - how do i execute this code. I can initialise, copy and free up memory. But it’s executing the kernel that i need help with, choosing the block / grid size.

Let’s take my above example:

numPoints1 = 57
numPoints2 = 611
numOffsets = 113

// now execute the kernel

// what numbers do i use here?
dim3 block, grid;

  <<< block, grid   /* <<----- is this even correct???? */
  >>> (polygon1, polygon2, offsets, numPoints1, numPoints2, numOffsets, results);

I of course don’t want to run each calculation one at a time, but would like to run a reasonable amount in parallel.
Any help on this would be really appreciated.

You can launch 512 threads per block in a 3D grid with block(8,8,8). (compute capability >=2.x)
You should compute the correct number of blocks in each dimension.

grid.x = numPoints1/block.x //  +1 if (numPoints1%block.x !=0)

In your example the configuration will be:

<<< grid, block>>>

If I was doing this I think I would make each block do 1 line of polygon1, each thread do 1 line of polygon2, and have the threads loop through the offsets. Try with number of threads per block fixed at 512, but also try with it fixed at say 32 and 128. If polygon2 has more lines than there are threads in a block then just have several blocks doing the same line of polygon1 but different ranges on lines from polygon2.

All the best