CUDA C++ - how to define an array of unknown size in cuda kernel (not with extern __shared__ )?

I am trying to parallelize the computation of a metric on the nodes of a graph.

As an approach I have made each thread calculate the metric on a node (since the calculation on a node is independent).

Each thread will have to calculate the neighbors at distance one of the considered node and store them in an array of initially unknown size (and different for each node).

I can’t use extern shared array because each thread has to compute its own array and can’t be shared.

I can’t declare a (max) fixed array size because it would be very inefficient for my task.

Is there any another way to handle this array or other dynamic data structures?

This is an extract of the kernel function:

__global__ void expectedForce(int* IR_vec, int* IC_vec, int n_IR)
{   
    double ExF = 0;
    int seed = blockDim.x * blockIdx.x + threadIdx.x+1;
    
    if(seed<n_IR) {
        
        int valRiga = IR_vec[seed];
        int distOne[]; // that's the array I have to handle
    ...}

}

you can do malloc() inside a CUDA kernel. Based on that it would be possible to create a std::vector-like container that can dynamically scale up its size.

Note that the resulting pointer can only be free’d using free() inside a CUDA kernel.

Christian

2 Likes
1 Like

Here are two principles to bear in mind when writing GPU code:

  • The most efficient way to do something difficult is to avoid having to do it in the first place.
  • Feed the GPU the kind of work it likes to do.

Now, for your specific question: First, if you can establish a moderate upper bound Δ on the degree of each node in V, pre-allocate an array of |V|* Δ, think of it as |V| arrays of Δ elements each, and have each thread store its node’s neighbor indices in its respective small array.

Otherwise:

  • Allocate an array of |V| elements, to store the prefix-sum of node degrees.
  • Have each thread count its degree.
  • Compute the overall and prefix sum of the degrees; store it in the array (you could forego the whole storage thing, but then you’d need to recompute the prefix sum later).
  • This overall sum is the amount of memory to allocate (times the size of a node index of course); do it either outside the kernel, or alternatively you could use device-wide synchronization and device-side malloc (I’d try to avoid that).
  • With the memory allocated, you now perform the metric computations again, and you use the prefix sum array to determine where to store the neighbor indices.

This may seem like more work to you, but let’s just say - just try to make millions of malloc() calls and see how well that plays out :-P the algorithm above has super-parallelizable steps, except for the prefix sum which takes O(log(|V|)) time even with unbounded parallelism - but at least for that there are effective implementations available.

1 Like

To emphasize @epk’s last point: In general, to achieve high performance, frequent dynamic memory allocation (and de-allocation) should be avoided in GPU programming as well as in multithreaded programming on CPUs. Corollary: Where possible, re-use dynamically allocated memory.

Consider using a hybrid approach with preallocated arrays that fits e.g. 95% of all cases. And the remaining 5% of all threads have to allocate larger buffers as needed, e.g. using malloc() or new.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.