I suggest you take a look at Thrust [1], which provides STL-like containers and algorithms for CUDA. Note that you can’t send over a vector<Node *> to the device unless those pointers correspond to Nodes stored on the device. IOW you will need to transfer the nodes to the device first.
You don’t need to start from scratch, however for best performance you might need to restructure your data layout. Specifically, instead of a so-called “array of structures” (AoS) (or equivalently, array of classes) it is generally better to have a “structure of arrays” (SoA). For example, instead of
struct Node
{
float foo;
int bar;
};
std::vector<Node> nodes;
you would want to store the member variables in separate arrays
SoA is faster because memory operations are properly coalesced (uncoalesced memory accesses are much slower!). Also, if you only access some of the member variables in a given kernel, then it is wasteful to load the entire struct from memory.
Note that Thrust provides some support to make SoA look more like AoS. Here’s an example [2] of how thrust::zip_iterator can combine 3 separate float vectors into a virtual vector.
Some C++ features such as class methods aren’t officially supported in device code, but they are known to work.
In order to call the Node’s getX() method, you must declare it as a device (or both host and device) function in the class definition like so:
[codebox]
class Node
{
device
float getX();
};
[/codebox]
This may require some rewriting of your code. Keep in mind that CUDA currently requires all device methods to be inlined, so you’ll need to declare the class methods you wish to call inline as well.
thanks ill see what i can achiveve with this tip. :)
about thrust v1.1, amazing high level API you made guys! i was looking at the examples and its simple and clean gpu programming!
im going to use it probably one some modules of this project and in the next one for sure.
by the way the software im working is an application that make 3D meshes grow like if you were filling them with air. there is no problem in displacing the vertexes with CUDA, but some of the algorithms implemented do extra work on the nodes (vertices) and sometimes deletes the unnecesary ones to make the mesh consistent, for example a vertex that ended inside the mesh because of the deformation applied. all these is already working and was programmed OOP, so i have the “mesh” class, inside it i got vector<Node *> nodes, inside node class i got class Point for position and Vector for normal, to finally end with simple atributes as x,y,z. so thats why i should use what i can of the compatibility of CUDA and classes.
looking into the future work of this project, i might need to face the problematic of working with dynamic arrays (vector< > or in the worst case a custom class) inside kernels, is this bad for CUDA? i ask because all the examples i read from the guides and best practice doc are using constant size memory. i need to dynamically be able to delete an element of the vector<> if it is necessary, or add in other cases, which means more memory dynamically added or freed using the vector class functions inside the kernel (add remove)
but rethinking now… vector class is already implemented so i guess its functions add and delete are only host available.
what you guys think of this, best approach ?
edit: i found this and i think is an implementation of a vector class for CUDA, i think its a dynamic array class what do you guys think of it?
Truly dynamic data structures are a pain in the butt in all kinds of parallel processing. Consider deleting an element in a vector - what happens to the indices of the following elements? This gets even nastier in GPUs since you can’t allocate or free memory from within kernels, only from host. The way you can remove elements from a static vector is by flagging them as invalid within the kernel and then pruning them on the host side. Adding elements can be trickier, you’d probably need some fixed-size buffer allocated on the device where the threads would write, which would be merged with the vector on the host. This all requires post- or pre-processing of the data structure on the host, requiring frequent memcpys and is exactly as slow as it sounds.
You wouldn’t want to keep allocating and deallocating memory for every invalid/new element on any platform, even a CPU, and you should consider it even thorougher for the GPU. Mallocs, reallocs and memcpys are always expensive.
Thrust has similar vector containers named thrust::host_vector and thrust::device_vector. Unfortunately, there is now way to resize() a vector in a kernel right now, so you’ll have to do that from the host.
The operation you want to perform is usually called “stream compaction” and Thrust provides several functions to do it. Here’s a simple example of stream compaction with Thrust.