Fast texture reading fast memory reading would suffice!

Hello guys,

My latest CUDA implementation is trying to count the number of triangles that overlap each cell of a regular grid in 3D. Each thread iterates through all triangles and store a counter with the number of overlaps. I’m getting really low performance, like 12ms for 2k triangles with about 77 blocks of size 192x1. If I increase to 9k triangles and 286 blocks, performance drops to 125ms (!). I suspect I may be memory-bound, but who knows.

Since we don’t have 3D textures just yet, I have decided to work with 1d textures, blocks and grids. Here is the complete kernel:

// Step 1: compute cell 1D and 3D coordinates

int cellIdx = blockIdx.x * blockDim.x + threadIdx.x;

// Skip padding threads

if( cellIdx >= grid.gridSize.x * grid.gridSize.y * grid.gridSize.z )


float3 cellCoords = grid.to3dCoord( cellIdx );

// Step 2: iterate through all triangles and count how many intersect this cell

float3 cellMinv = grid.boxMin + cellCoords * grid.cellSize;

float3 cellMaxv = cellMinv + grid.cellSize;

// Output counter

float numOverlapTriangles = 0.0f;

__shared__ float3 s_triVertices[g_blockWidth];


    #define s_vertex(i) CUT_BANK_CHECKER( s_triVertices, i )


    #define s_vertex(i) s_triVertices[i]


// For each triangle, check if it overlaps current cell

for( unsigned int v = 0; v < vertexCount; v+=g_blockWidth )


    // Each thread loads a single vertex to shared memory

    s_vertex(threadIdx.x) = tex1Dfetch3( texVertices, v + threadIdx.x );

   // Sync before processing shared memory


   // Loop over shared vertices

    for( unsigned int i = 0; i < g_blockWidth; i+=3 )


        float3 v0 = s_vertex(i);

        float3 v1 = s_vertex(i+1);

        float3 v2 = s_vertex(i+2);

       if( overlaps( cellMinv, cellMaxv, v0, v1, v2 ) )



   // Sync before loading another batch of vertices to shared memory



// Step 3: write number of overlapping triangles in output

output[cellIdx] = numOverlapTriangles;

I am aware that this version may skip the entire last block due to _syncthreads() + divergent branch at the beginning.

But my main question is: am I doing something dumb here? Or the performance figures are right? Is there a faster way to loop over hundreds of thousands of floats in texture memory (or global)? Or is it just too much for CUDA to handle?

Ah, the old partitioning problem. Impossible to do efficiently in parallel, to my knowledge, and I’ve been trying to think of a solution for ~1 year now. My current code still does it on the CPU because it is faster (though I’m partitioning points, not triangles).

You’re looping over all the data in the most efficient method I am aware of. I think the problem you are running into is that the device only has so much memory bandwidth to give, and your algorithm, while parallel, is extremely memory intensive and scales as O(N_cells * N_triangles). Compare this to a simple non-parallel CPU version that loops over each triangle and adds the count to the corresponding, it only scales as O(N_triangles).

Despite the O(N)->O(N^2) algorithmic problem, it is always good to check and see how optimally you are using the device. Compute the effective GiB/s of memory transfer you are pulling from global memory and the effective GFLOP/s rate of your inner loop. The 8800 GTX should attain performances of 70 GiB/s and ~100 GFLOP/s.