How to quickly identify if the boxes of a triangle and a circle are overlapped

Here what I want to do is to identify the box of which triangle in a set of 2D triangles is overlapped with the box of a circle. The triangle that is not overlapped will be assigned to -1, otherwise to its own index.

The triangles’ vertices are stored in a thrust::device_vector. the d_pTriXys is the raw pointer to the vector. Each triangles’ 3 indices pointing to index of vertex array are stored in another device vector. d_pTriIndices is a raw pointer to this vector. triNum is the count of triangle.

The code is done in visual studio C++ and it works. But the performance seems to be an issue. It takes about 3.5 seconds for the case that has 2225 circles being checked with 5.5 million triangles, 2.8M vertices. It is not as fast as I wished. Does anybody have any idea to improve the algorithm for better performance?

I tried to change the register count, threadsPerBlock. right now the setting seems to be the best.

// RangeAlgo.h

  struct tagXy_f
     float x;
     float y;

CUDA_HOST_DEV __forceinline bool IsCircleAndTriangleBoxesSeparated(const tagXy_f triVertices[3],
     const tagXy_f &circleCen, float circleRad)

     float t = min(triVertices[0].x, min(triVertices[1].x, triVertices[2].x));
     if (circleCen.x <= t - circleRad)
        return true;

     t = max(triVertices[0].x, max(triVertices[1].x, triVertices[2].x));
     if (circleCen.x >= t + circleRad)
        return true;

     t = min(triVertices[0].y, min(triVertices[1].y, triVertices[2].y));
     if (circleCen.y <= t - circleRad)
        return true;

     t = max(triVertices[0].y, max(triVertices[1].y, triVertices[2].y));
     return (circleCen.y >= t + circleRad);


  __global__ void MarkOutTrianglesKernel(const uint3 *d_pTriIndices, unsigned triNum,
     const float2 *d_pTriXys, const tagXy_f &d_circleCen, float circleRad,
     int *d_pValidTriIndices)
     unsigned tid = blockIdx.x * blockDim.x + threadIdx.x;

     if (tid < triNum)
        const uint3 &triInds = d_pTriIndices[tid];
        // to local variable/registers, make it faster
        const float2 triVertices[3] = { d_pTriXys[triInds.x],
        d_pTriXys[triInds.y], d_pTriXys[triInds.z] };

        bool b = IsCircleAndTriangleBoxesSeparated(reinterpret_cast<const tagXy_f *>(triVertices),
            d_circleCen, circleRad);

        d_pValidTriIndices[tid] = b ? -1 : (int)tid;

void main()
     // here is the sample code to check one circle with all triangles
     uint3 *d_pTriIndices;      // indices of all triangles
     unsigned triNum;           // count of triangles
     const float2 *d_pTriXys;   // all vertices of triangles
     tagXy_f d_circleCen;
     float circleRad;
     int *d_pValidTriIndices;


     int threadsPerBlock = 256;
     int blocksPerGrid = (int)((triNum + threadsPerBlock - 1) / threadsPerBlock);

     MarkOutTrianglesKernel<<<blocksPerGrid, threadsPerBlock>>>(d_pTriIndices, triNum,
        d_pTriXys, d_circleCen, circleRad, d_pValidTriIndices);


Is this algorithm intended to be approximate? It doesn’t appear to be exactly correct in all cases.

Hi Robert, you are right. it is intended to be approximate. I just want a quick method to identify each triangle whose CONTAINING BOX is overlapped with circle. I’ll correct the function name and the title.

Since all vertices and triangle indices are stored in global memory, I strongly doubt the bad performance is because randomly access global memory. But don’t how to avoid the access or if there is other problems in the code.

If containing box is anchored on any of the triangle vertices, I don’t think it does that either. For example, your algorithm would declare overlap in this setting, when there is none either for the triangle or its bounding box:

External Media

The reason for my persistence is that I claim that there is no point optimizing an algorithm that does not give you the correct answer. If you want to arbitrarily say “this is correct”, then proceed. But it’s not clear to me that the code does what you say it does.

you are right again! but here I just need a super faster method to identify the triangles that are nearby or overlapped with the circle. so any function that use multiplication, sqrt(…) are not used. There will be further processing for these triangles.

The triangles that need further processing is down from 5.5M to 2K level after calling this function in my case. I profiles the code, this triangle marking function takes 3.5" and the further processing takes only 0.6". So I made this marking function as simple as I can to speed up the whole process. But unfortunately, it is still not faster enough.

Without a complete case to test, it’s hard to proceed further. I would focus next on this line:

const float2 triVertices[3] = { d_pTriXys[triInds.x],
        d_pTriXys[triInds.y], d_pTriXys[triInds.z] };

The indirect loading process you are using implies that triInds.x,.y,.z may be referring to 3 different places in memory. In the general case that will result in quite inefficient loading.

You might want to learn how to use a profiler, this is an easy thing to confirm.

If you can rearrange your triangle indices so that the 3 indices associated with a triangle are grouped together/adjacent, you can probably improve that load process. Ideally you would want a list of triangles such that the indices per triangle are all adjacent, and adjacent threads are handling adjacent triangles (in the list). After that, you might want to recast your vertex storage so that instead of using AoS storage, you are using SoA storage for the x and y components of the vertices.

I wouldn’t be surprised if that one line of code is the biggest consumer of time.

here the triangle is a unit of STL model (a geometry model used in geometry modeling system). if all vertices of triangle are stored in one or multiple array(s) and all triangle’s indices to their vertices are stored in another array. even the triangles are stored in the order that they shares common edge. it is still avoidable that most of triangle’s vertices are stored in different places. And my triangle data is created by “third” part, it also will consume time to sort it in my side.

the vertices are 3D data, I already changed the code to use two arrays (XY & Z) to store them a while ago because the XY data is always accessed together.

Today I tried what you said to store the XY data in two arrays. it does reduce the running time by 0.6 second (the total time was 5"). but it is still not faster enough!

I do use code profiling, but still in the learning phase. do you know any good book or online documents about CUDA profiling in Visual Studio C++ environment?

when the routine is running, the Windows Task Manager shows the GPU usage is less than 10%. does it mean I still have long way to go to optimize this piece of code?

I’ll just repeat 2 things I said already, since I have nothing else to say here:

"Without a complete case to test, it’s hard to proceed further. "


“I wouldn’t be surprised if that one line of code is the biggest consumer of time.”

If you do a google search on “gtc cuda optimization visual studio” you will find other high-quality resources like that one. (Please, try it right now. The top 5 hits at least are all good.) People like Christoph Angerer, Julien Demouth, and Norbert Juffa are like wizards, geniuses, ninjas. It makes sense to climb their mountain, sit at their feet, and learn.

I personally would not attempt to judge the effectiveness or efficiency of a code based on a crude tool such as windows task manager or nvidia-smi. IMO the right way to make such judgements is with a profiler.

There is a mathematical geometry library called Wykobi and I believe it has a function for this. It has them for all kinds of different things. Linkage :