Thread processing overhead


i have problem with allocating memory and writing to it through threads.
my situation:

i want to write a CUDA acceleration structure.
so i start with a bunch of triangles, we assume 17 tris.
the task is to compute the bounding boxes of these tris.
i allocate tri_count * 2 * float3 memory in which i want to write with the kernel threads.
now i start 4 blocks with 5 threads, every thread handling one tri (so the configuration is just a simple example).

the problem is, that the allocated memory does not fit to the thread count.
there are more threads than tris, the writing operation of the last 3 threads aren’t valid.
what is the best way to avoid this?


similar to this (from CUDA C Programming Guide Version 3.2 )

__global__ void VecAdd(float* A, float* B, float* C, int N) 


  int i = blockDim.x * blockIdx.x + threadIdx.x; 

  if (i < N) 

     C[i] = A[i] + B[i]; 


where you pass N (in your example 17 ) to kernel like this

VecAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, N);

thanks, did it this way.
i thought there would be another solution (integrated into CUDA), because it is a common problem.
but this one is also good.