Integrate Cuda to an existing OOP application re-design everything?

neoideo · September 14, 2009, 8:11pm

im having difficulties integrating cuda to an already existing application which was done using Object Oriented Programmation (OOP).

what is you experience integrating Cuda to these type of paradigm??

is there an easy way or we just need to re-design everything from scratch in the form of structs and functions?

for example i had the Node class, and wanted te kernel to receive an array of Nodes in the form of

kernel( vector<Node*> nodes, float displace ){

...

...

float x = nodes->getX();

}

but the access to the method getX() threw errors on nvcc compiler saying that the function is on Host side.

any feedback on this experience and the best to solve it is very useful .

thanks in advance

cristobal

nbell · September 14, 2009, 9:08pm

im having difficulties integrating cuda to an already existing application which was done using Object Oriented Programmation (OOP).

what is you experience integrating Cuda to these type of paradigm??

is there an easy way or we just need to re-design everything from scratch in the form of structs and functions?

for example i had the Node class, and wanted te kernel to receive an array of Nodes in the form of
kernel( vector<Node*> nodes, float displace ){

...

...

float x = nodes->getX();

}
but the access to the method getX() threw errors on nvcc compiler saying that the function is on Host side.

any feedback on this experience and the best to solve it is very useful .

thanks in advance

cristobal

Hi Cristobal,

I suggest you take a look at Thrust [1], which provides STL-like containers and algorithms for CUDA. Note that you can’t send over a vector<Node *> to the device unless those pointers correspond to Nodes stored on the device. IOW you will need to transfer the nodes to the device first.

You don’t need to start from scratch, however for best performance you might need to restructure your data layout. Specifically, instead of a so-called “array of structures” (AoS) (or equivalently, array of classes) it is generally better to have a “structure of arrays” (SoA). For example, instead of

struct Node

{

   float foo;

   int bar;

};

std::vector<Node> nodes;

you would want to store the member variables in separate arrays

struct NodeVector

{

   std::vector<float> foo;

   std::vector<int> bar;

};

SoA is faster because memory operations are properly coalesced (uncoalesced memory accesses are much slower!). Also, if you only access some of the member variables in a given kernel, then it is wasteful to load the entire struct from memory.

Note that Thrust provides some support to make SoA look more like AoS. Here’s an example [2] of how thrust::zip_iterator can combine 3 separate float vectors into a virtual vector.

[1] http://code.google.com/p/thrust/

[2] http://code.google.com/p/thrust/source/bro…cts_with_zip.cu

JaredHoberock · September 14, 2009, 9:10pm

Some C++ features such as class methods aren’t officially supported in device code, but they are known to work.

In order to call the Node’s getX() method, you must declare it as a device (or both host and device) function in the class definition like so:

[codebox]

class Node

{

device

float getX();

};

[/codebox]

This may require some rewriting of your code. Keep in mind that CUDA currently requires all device methods to be inlined, so you’ll need to declare the class methods you wish to call inline as well.

neoideo · September 15, 2009, 1:27pm

thanks ill see what i can achiveve with this tip. :)

about thrust v1.1, amazing high level API you made guys! i was looking at the examples and its simple and clean gpu programming!

im going to use it probably one some modules of this project and in the next one for sure.

by the way the software im working is an application that make 3D meshes grow like if you were filling them with air. there is no problem in displacing the vertexes with CUDA, but some of the algorithms implemented do extra work on the nodes (vertices) and sometimes deletes the unnecesary ones to make the mesh consistent, for example a vertex that ended inside the mesh because of the deformation applied. all these is already working and was programmed OOP, so i have the “mesh” class, inside it i got vector<Node *> nodes, inside node class i got class Point for position and Vector for normal, to finally end with simple atributes as x,y,z. so thats why i should use what i can of the compatibility of CUDA and classes.

looking into the future work of this project, i might need to face the problematic of working with dynamic arrays (vector< > or in the worst case a custom class) inside kernels, is this bad for CUDA? i ask because all the examples i read from the guides and best practice doc are using constant size memory. i need to dynamically be able to delete an element of the vector<> if it is necessary, or add in other cases, which means more memory dynamically added or freed using the vector class functions inside the kernel (add remove)

but rethinking now… vector class is already implemented so i guess its functions add and delete are only host available.

what you guys think of this, best approach ?

edit: i found this and i think is an implementation of a vector class for CUDA, i think its a dynamic array class what do you guys think of it?

_Big_Mac · September 15, 2009, 3:40pm

Truly dynamic data structures are a pain in the butt in all kinds of parallel processing. Consider deleting an element in a vector - what happens to the indices of the following elements? This gets even nastier in GPUs since you can’t allocate or free memory from within kernels, only from host. The way you can remove elements from a static vector is by flagging them as invalid within the kernel and then pruning them on the host side. Adding elements can be trickier, you’d probably need some fixed-size buffer allocated on the device where the threads would write, which would be merged with the vector on the host. This all requires post- or pre-processing of the data structure on the host, requiring frequent memcpys and is exactly as slow as it sounds.

You wouldn’t want to keep allocating and deallocating memory for every invalid/new element on any platform, even a CPU, and you should consider it even thorougher for the GPU. Mallocs, reallocs and memcpys are always expensive.

neoideo · September 16, 2009, 5:07pm

Truly dynamic data structures are a pain in the butt in all kinds of parallel processing. Consider deleting an element in a vector - what happens to the indices of the following elements? This gets even nastier in GPUs since you can’t allocate or free memory from within kernels, only from host. The way you can remove elements from a static vector is by flagging them as invalid within the kernel and then pruning them on the host side. Adding elements can be trickier, you’d probably need some fixed-size buffer allocated on the device where the threads would write, which would be merged with the vector on the host. This all requires post- or pre-processing of the data structure on the host, requiring frequent memcpys and is exactly as slow as it sounds.

You wouldn’t want to keep allocating and deallocating memory for every invalid/new element on any platform, even a CPU, and you should consider it even thorougher for the GPU. Mallocs, reallocs and memcpys are always expensive.

yea ill try to do what i can.

i was recheking the algorithms and i think is not necessary to add new vertexes so that is good.

did anyone check that vector class i linked on the previous post?

nbell · September 16, 2009, 11:51pm

thanks ill see what i can achiveve with this tip. :)

by the way the software im working is an application that make 3D meshes grow like if you were filling them with air. there is no problem in displacing the vertexes with CUDA, but some of the algorithms implemented do extra work on the nodes (vertices) and sometimes deletes the unnecesary ones to make the mesh consistent, for example a vertex that ended inside the mesh because of the deformation applied. all these is already working and was programmed OOP, so i have the “mesh” class, inside it i got vector<Node *> nodes, inside node class i got class Point for position and Vector for normal, to finally end with simple atributes as x,y,z. so thats why i should use what i can of the compatibility of CUDA and classes.

looking into the future work of this project, i might need to face the problematic of working with dynamic arrays (vector< > or in the worst case a custom class) inside kernels, is this bad for CUDA? i ask because all the examples i read from the guides and best practice doc are using constant size memory. i need to dynamically be able to delete an element of the vector<> if it is necessary, or add in other cases, which means more memory dynamically added or freed using the vector class functions inside the kernel (add remove)

but rethinking now… vector class is already implemented so i guess its functions add and delete are only host available.

what you guys think of this, best approach ?

edit: i found this and i think is an implementation of a vector class for CUDA, i think its a dynamic array class what do you guys think of it?

Thrust has similar vector containers named thrust::host_vector and thrust::device_vector. Unfortunately, there is now way to resize() a vector in a kernel right now, so you’ll have to do that from the host.

The operation you want to perform is usually called “stream compaction” and Thrust provides several functions to do it. Here’s a simple example of stream compaction with Thrust.

Topic		Replies	Views
Visual Studio 2012 + OptiX + Thrust = compile errors CUDA Programming and Performance	8	9754	February 17, 2016
Polite ask for help with paralelling program using CUDA. Task: Rewrite given program so it will use CUDA Programming and Performance	13	9284	April 20, 2011
Passing thurst vector into kernel and pushing data into vector CUDA Programming and Performance	8	7851	January 2, 2018
How to create vector of objects in the device? CUDA Programming and Performance cuda	1	904	February 2, 2023
Problem with vector comparison CUDA Programming and Performance cuda	7	501	October 27, 2022
Vector push_back in Thrust device code and strange behavior of nvc++/stdpar nvc, nvc++ and nvfortran cuda	0	1058	January 18, 2022
How to efficiently sort 5 arrays of integers? CUDA Programming and Performance	7	1162	June 19, 2015
an easy-to-use CUDA library CUDA Programming and Performance	28	6555	January 29, 2012
Possible CUDA improvements CUDA Programming and Performance	7	6122	July 14, 2008
CUDA Parallel BFSs over a c++ vector implemented graph CUDA Programming and Performance	6	745	December 30, 2019

Integrate Cuda to an existing OOP application re-design everything?

Related topics