CUDA Parallel BFSs over a c++ vector implemented graph

Hi,

Im new to GPGPU programming and thinking to use it in order to simulate a model that requires running a BFS from each node of the graph during each step to of the simulation to determine the shortest path from each node to a set of destinations (other nodes).

The current model is written in C++ using the std:vector lib - the entire adjacency list is a vector<vector> and more importantly, there’s a vector<vector> that holds information about the path participation of each node, meaning :
if node i is on the path of node j to one of it’s destination, j would be added to the i-th vector in the vector of vectors. the reason is because the numbers of participated node-paths isn’t constant.

I was thinking of speeding up the simulation time (which is doing a BFS from each node on each iteration) by implementing it with Cuda by using a kernel function where each BFS runs in a different thread and updates the participation vector accordingly (Im not trying to run the same BFS using parallelism, Im trying to run many BFSs each BFS done on a different GPU thread).

As much as I know global functions cannot handle 2D vectors, and I wish to avoid writing the entire model using 2D arrays (which would require fixed sizes and will also force enlarging the participation container into a sizeXsize matrix that would need to be copied from the device to the host in each step).
So I wanted to ask if there is a time efficient way to make such a conversion (from a vector of vectors graph to something a device function can work with) on each step ? or if there is no better solution than to convert the entire model to use arrays ?

Thank you very much.

I think you have a couple of options.

  1. There is no reason you can’t flatten and copy the data from your 2D vectors and access the information using a 2D grid and block. Check out http://www.mathcs.emory.edu/~cheung/Courses/355/Syllabus/94-CUDA/2D-grids.html.

After you flatten the vector, you can use the .data() to extract the contents.

  1. Check out Thrust https://thrust.github.io/. I don’t believe it does 2D vectors, but again you should be able to flatten it.

One thing you want to keep in mind, is how to get the adjacency matrix to each independent threads. Depending on the size of the matrix there are not enough registers to store the entire matrix for each thread.

Lastly, while it’s not the same as your problem, check out https://github.com/mnicely/cublasLt_examples/blob/master/cublasLt_sgemm.cu. You can see how to map a 2D matrix to a 1D thrust::host_vector and copy to device memory.

First of all, thank you for the reply.

In what size scales (of the matrix) does having the adjacency matrix copied to each thread is no longer possible/efficient? is there an efficient way to store large data structures in a shared GPU memory or something like that?

And concerning a device function changing size of the data structure it was given (for example in changing the participating list), are there any problems with that or is it just memcopying all the same?

Thanks for the help, it is much appreciated.

In what size scales (of the matrix) does having the adjacency matrix copied to each thread is no longer possible/efficient? is there an efficient way to store large data structures in a shared GPU memory or something like that?

The number of registers available to individual threads in a block depend on the number of threads in a block. If you check out the compute capabilities table in the Programming Guide, https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#compute-capabilities, for some architectures, there are a maximum of 64K registers per block and 255 registers per thread.

Let’s say you use the maximum number of threads per block, or 1024, that means you only have 64 registers per block. But setting the block size to 32 threads doesn’t increase the number of register per thread higher than 255. Make sense?

Profiling your code with Nsight Compute will give you much greater insight! https://devblogs.nvidia.com/using-nsight-compute-to-inspect-your-kernels/

As far as efficiency, if a thread uses too many threads there will be spillage into local memory. Local memory is stored in global and suffers for the same limitations.

One solution can be to use some shared memory. Keep in mind that shared memory is a limited on-chip resource. If one block uses too much, it will prevent other block from launching on an SM and the kernel will run sub-optimally.

Depending on how large your adjacency matrix you have, you might want to allot more than one thread per matrix??

And concerning a device function changing size of the data structure it was given (for example in changing the participating list), are there any problems with that or is it just memcopying all the same?

Sorry, I’m not sure what you’re asking.

[b][i]And concerning a device function changing size of the data structure it was given (for example in changing the participating list), are there any problems with that or is it just memcopying all the same?

Sorry, I’m not sure what you’re asking.[/i][/b]

Meant to ask if there is any problem (in terms of both possibility and efficiency) to re-allocate memory of a data-type inside a kernel function (for example, when a device function increases the size of a certain array) and then copy back the updated data to the host? or is it handled the same as a normal C realloc function and can be used freely?

Thank you.

I’ve never tried to use realloc in CUDA device code, but this is not the first time the question has been asked. Take a look at the following links. You might have to code it up and profile the performance to get a concrete answer.

https://stackoverflow.com/questions/18780709/cuda-dynamic-memory-allocation-in-kernel
https://stackoverflow.com/questions/5322583/cuda-using-realloc-inside-kernel
https://stackoverflow.com/questions/25436198/implementing-realloc-in-cuda-without-moving-data

Thank you mnicely, it is much appreciated.