i have problem with allocating memory and writing to it through threads.
i want to write a CUDA acceleration structure.
so i start with a bunch of triangles, we assume 17 tris.
the task is to compute the bounding boxes of these tris.
i allocate tri_count * 2 * float3 memory in which i want to write with the kernel threads.
now i start 4 blocks with 5 threads, every thread handling one tri (so the configuration is just a simple example).
the problem is, that the allocated memory does not fit to the thread count.
there are more threads than tris, the writing operation of the last 3 threads aren’t valid.
what is the best way to avoid this?