I have a question regarding the best practices for declaring fixed-size structs in a .cuh file for GPU programming. Specifically, I am working with a medium/small-sized struct, IntersectedTriangles, but I also need to handle larger structs.
Would it be more efficient to declare these structs with direct allocation or as pointers in terms of GPU performance and memory management?
Thank you for your time
Rafael Scatena
My .cuh file:
static const int pilhaPart = 3072; //64*64
typedef struct {
int64_t id[30][pilhaPart];
int Body[30][pilhaPart];
int Material[30][pilhaPart];
Point intersection_point[30][pilhaPart];
double distance[30][pilhaPart];
int intersectCount[pilhaPart];
} IntersectedTriangles;
extern __device__ IntersectedTriangles* d_intersectedTriangles; //
or:
extern __device__ IntersectedTriangles d_intersectedTriangles; // For writable memory
Optimize that threads within a warp access data, which is consecutive.
In your case, if they access consecutive triangles with the pilhaPart index, all is good as it is.
Advanced
If the access is more random, then perhaps the optimum in terms of performance would be, if you make the triangles a nice size (32 bytes) and keep the member variables of a triangle together and move the array outside of the triangle (i.e. 2D-array of triangles).
Then groups of 8 threads would cooperate, each one chooses a random triangle, then they go into a loop from 0 to 7 and in each iteration one triangle is cooperatively loaded and stored in shared memory. Then each thread fetches back its own triangle from shared memory.
You could also (perhaps better) use groups of 4 or 2 threads instead of 8 threads and load the data with 64-bit or 128-bit accesses (e.g. uint2, uint4).
With groups two threads you do not need shared memory, but can just use shuffle instructions to exchange data with the other paired thread.
If each pointed to struct is 453 MB and you do not have lots of them, loading the pointers is probably L1-cached, as each thread needs the same pointers.
So would be okayish.
But it would be simpler, if it is just continuous memory, and the value can be calculated.
Or if you would pass-by-value all the pointers (e.g. as a struct of pointers or individually) as kernel parameters.
With it Cuda can use the constant path to transfer them to each thread.