Hi,
I am new to CUDA, I am trying to run part of my current algorithm using GPU through CUDAFy.net, current implementation is in c#.
Algorithm is based on Tree structure, each tree node has multiple arrays of double values, and various number of children.
First step of the algorithm is same calculation on all tree nodes without dependency, which is rather straight forward, I could pass 2D arrays, and use tree-node-id(0 based index) to reference those arrays.
Next few steps need to traverse trees(I have multiple trees, so I can fire up 1 thread for each tree root), so parent-child relationship needs to be on device (so recursive calculation can happen on device). I found it’s challenging to allocate a tree on device. I found an example which allows creating device struct instance with multiple arrays in it (non-fix-length arrays). But I then have to create an array of device struct instances, so the array of zero-based-index in each node can be used…
My instinct tells me it’s too complicate to be the right approach. I would like to get some advice, examples, suggestions…
Out of curiosity, is your tree static from the GPU’s perspective? If your use-case is rapid insertion/deletion in equal parts, I’m not sure how well the GPU can solve that kind of problem.
Otherwise, if you’re only mutating the data inside the nodes, the GPU can probably do what you want.
Thanks MutantJohn,
The answer is “yes”, these Trees are static from GPU point of view. I parse Trees out of a DB table when algorithm starts, and all calculations are based on these Trees, I won’t insert, update, delete tree nodes within the algorithm. Although next time when this algorithm starts, or get pointed to a different table, Tree structure or number of nodes might change.
Sounds like you think it’s doable. Could you give me some suggestions?
My current implementation (c#) is TreeNode as a class, has multiple hashmap as fields (length of each map is based on config file, or that table, so not constant value). And each node has a hashmap of children (same TreeNode). All treenodes from all trees are in a big hashmap, so I can choose to iterate through all nodes in flat manner, or I can traverse each tree in certain order. All nodes has a zero index field for storing result in a wide array.
-
Should I use a structure to mimic the class I have, and make the children hashmap an array of pointers points to other tree structure in device memory? All other hashmap are converted to an array. This is the closest translation of my current code. How hard is this approach?
-
Or should I use an array of index for the children hashmap, and copy all nodes to an array, with the same index as each node’s index. This way I save the pointer assignment (Or not?)
3.Or use a 2d array, with the same index on the 1st dimension, to save children of each node as second dimension array? Because children number varies, I have to save maximum size, basically, [num_of_nodes,num_of_nodes]. And then one 2d array for all other fields. This is easier to implement, but code will be harder to read.
4.Or, anything else.
Okay, so I do have some ideas about this.
In general in CUDA, you will always want a structure of arrays vs anything else.
This is because how the GPU accesses memory.
So I’d recommend a data structure like this:
#include <type_traits>
#include <thrust/tuple.h>
#include <thrust/iterator/zip_iterator.h>
namespace T = thrust;
size_t const static tpb = 256; // threads per block
size_t const static bpg = 512; // blocks per grid
__device__ __inline__
auto get_tid(void) -> size_t
{
return blockIdx.x * blockDim.x + threadIdx.x;
}
constexpr
__device__
auto grid_stride(void) -> size_t
{
return tpb * bpg;
}
struct tree_t
{
// put your data fields here
int* field1;
int* field2;
int* field3;
size_t* num_childrens;
ptrdiff_t* child_offsets;
};
__global__
void some_kernel(tree_t tree, size_t const tree_size)
{
for (auto tid = get_tid(); tid < tree_size; tid += grid_stride()) {
// want to traverse children
auto const child_idx = tree.child_offsets[tid];
auto const num_children = tree.num_childrens[tid];
using index_t = typename std::remove_const<decltype(child_idx)>::type;
auto fields = T::make_zip_iterator(
T::make_tuple(
tree.field1 + child_idx,
tree.field2 + child_idx,
tree.field3 + child_idx));
// this will likely thrash coalesence
for (index_t idx = child_idx; idx < child_idx + num_children; ++idx) {
++fields;
}
}
}
int main(void)
{
return 0;
}
I wound up using a zip iterator on the device just for convenience of traversing multiple arrays at once. Not sure about the quality or if it’s valid but it at least compiles :P