Hello guys,

My latest CUDA implementation is trying to count the number of triangles that overlap each cell of a regular grid in 3D. Each thread iterates through all triangles and store a counter with the number of overlaps. I’m getting really low performance, like 12ms for 2k triangles with about 77 blocks of size 192x1. If I increase to 9k triangles and 286 blocks, performance drops to 125ms (!). I suspect I may be memory-bound, but who knows.

Since we don’t have 3D textures just yet, I have decided to work with 1d textures, blocks and grids. Here is the complete kernel:

```
// Step 1: compute cell 1D and 3D coordinates
int cellIdx = blockIdx.x * blockDim.x + threadIdx.x;
// Skip padding threads
if( cellIdx >= grid.gridSize.x * grid.gridSize.y * grid.gridSize.z )
return;
float3 cellCoords = grid.to3dCoord( cellIdx );
// Step 2: iterate through all triangles and count how many intersect this cell
float3 cellMinv = grid.boxMin + cellCoords * grid.cellSize;
float3 cellMaxv = cellMinv + grid.cellSize;
// Output counter
float numOverlapTriangles = 0.0f;
__shared__ float3 s_triVertices[g_blockWidth];
#ifdef __DEVICE_EMULATION__
#define s_vertex(i) CUT_BANK_CHECKER( s_triVertices, i )
#else
#define s_vertex(i) s_triVertices[i]
#endif
// For each triangle, check if it overlaps current cell
for( unsigned int v = 0; v < vertexCount; v+=g_blockWidth )
{
// Each thread loads a single vertex to shared memory
s_vertex(threadIdx.x) = tex1Dfetch3( texVertices, v + threadIdx.x );
// Sync before processing shared memory
__syncthreads();
// Loop over shared vertices
for( unsigned int i = 0; i < g_blockWidth; i+=3 )
{
float3 v0 = s_vertex(i);
float3 v1 = s_vertex(i+1);
float3 v2 = s_vertex(i+2);
if( overlaps( cellMinv, cellMaxv, v0, v1, v2 ) )
++numOverlapTriangles;
}
// Sync before loading another batch of vertices to shared memory
__syncthreads();
}
// Step 3: write number of overlapping triangles in output
output[cellIdx] = numOverlapTriangles;
```

I am aware that this version may skip the entire last block due to _syncthreads() + divergent branch at the beginning.

But my main question is: am I doing something dumb here? Or the performance figures are right? Is there a faster way to loop over hundreds of thousands of floats in texture memory (or global)? Or is it just too much for CUDA to handle?