I was just wondering how to replace STL types like std::vector and std::list the best way.
To be a little bit more specific, I have sth. like
std::vector<std::vector<std::list<struct>>>
Is there some “golden rule” how to transform such constructs to get them in other data types, so that I can have than in global memory and can do work with CUDA on the data?
I was just wondering how to replace STL types like std::vector and std::list the best way.
To be a little bit more specific, I have sth. like
std::vector<std::vector<std::list<struct>>>
Is there some “golden rule” how to transform such constructs to get them in other data types, so that I can have than in global memory and can do work with CUDA on the data?
The closest thing to a golden rule is: “CUDA works best with contiguous reads of flat arrays of simple data types.” Applying this rule leads to several data structure decisions:
N-dimensional arrays should be represented as linear arrays with a known strides for each dimension. Whether row-major vs. column major order is better depends on your access pattern.
A struct of arrays is better than an array of structs. This ensures that when threads all access a particular struct member, the access is coalesced by the memory controller.
For Fermi-class devices with an L1 and L2 cache, violating the “struct of arrays” guideline is not as big of a deal because the cache will read contiguous chunks of memory even if the threads do not.
Unfortunately, it sounds like you have 2D array of variable length lists where the element is a struct. This is going to be hard to reconcile with the golden rule. What is the data access pattern like? How will threads in the same block vs. different blocks interact with this data structure?
The closest thing to a golden rule is: “CUDA works best with contiguous reads of flat arrays of simple data types.” Applying this rule leads to several data structure decisions:
N-dimensional arrays should be represented as linear arrays with a known strides for each dimension. Whether row-major vs. column major order is better depends on your access pattern.
A struct of arrays is better than an array of structs. This ensures that when threads all access a particular struct member, the access is coalesced by the memory controller.
For Fermi-class devices with an L1 and L2 cache, violating the “struct of arrays” guideline is not as big of a deal because the cache will read contiguous chunks of memory even if the threads do not.
Unfortunately, it sounds like you have 2D array of variable length lists where the element is a struct. This is going to be hard to reconcile with the golden rule. What is the data access pattern like? How will threads in the same block vs. different blocks interact with this data structure?
the first two std::vectors are like the x/y-coordinates of a point
the std:llist is a list of neighbors to the current point
and in the struct there are several properties of each neighbor
At the moment, the calculation is done for each point after another. I want to do that in parallel with CUDA.
So… each thread should access a point and do some calculations for all the neighbors of that particular point.
I hope this helps.
But how to represent the data in CUDA?
Should i have an array of structs with an array of structs and so on?
I can’t see another option at that point. Good to hear, that the “golden rule” isn’t that important on Fermi, because I’m on that platform.
Sth. like this, for example:
struct tableEntry
{
unsigned int index;
float weight;
};
struct list
{
tableEntry *tE;
};
struct vectorY
{
list *l;
};
struct vectorX
{
vectorY *vY;
} *vX;
Doesn’t look very satisfacting…
The other thing is, I have to figure out, how much mem to allocate for each struct, because not every point has the exact same number of neighbors… well I think I’ll have to pad here…
the first two std::vectors are like the x/y-coordinates of a point
the std:llist is a list of neighbors to the current point
and in the struct there are several properties of each neighbor
At the moment, the calculation is done for each point after another. I want to do that in parallel with CUDA.
So… each thread should access a point and do some calculations for all the neighbors of that particular point.
I hope this helps.
But how to represent the data in CUDA?
Should i have an array of structs with an array of structs and so on?
I can’t see another option at that point. Good to hear, that the “golden rule” isn’t that important on Fermi, because I’m on that platform.
Sth. like this, for example:
struct tableEntry
{
unsigned int index;
float weight;
};
struct list
{
tableEntry *tE;
};
struct vectorY
{
list *l;
};
struct vectorX
{
vectorY *vY;
} *vX;
Doesn’t look very satisfacting…
The other thing is, I have to figure out, how much mem to allocate for each struct, because not every point has the exact same number of neighbors… well I think I’ll have to pad here…