I want to implement bfs in CUDA. I found a promising paper :“An Effective GPU Implementation of Breadth-First Search”.
Does anyone know where to get pseudocode or an implementiation example?
I already used the frontier Algorithm by P. Harish and P. J. Narayanan but it is slower than a fast CPU Implementation on graphs with big diameter. The algorithm is not work efficient.
I also found the paper: “Accelerating CUDA Graph Algorithms at Maximum Warp”. Did anyone implement the bfs Algorithm described in it?
Do you know more papers about efficient bfs CUDA implementations? Do you know which performs the best?
I have implemented breadth first search to replace a traditional (CPU) depth first recursive search.
Results were not promising. Even running with a beam width of 1.8 million the GPU was
no faster than the orginal CPU.
@ mfatica did you implement the algorithm described yourself? It seems very promising - thanks again! But it is hard to understand for me.
I got the source from svn. Its very much to read.
Is there a minimal example that implements just one of the shown strategies?
I have not much time (about 1 week). I need it for my thesis.
Do you know any easier algorithms that perform better than the frontier based?
@ wlangdon which algorithm did you use? Was it a frontier based one? They are slower than the CPU version on graphs with a big diameter and on uneven distributed graphs.
Did you read the paper mfatica posted? This seems to be fast. But I can’t tell yet.
It would really help me if I had a minimal code example. The source code available at google code is to complex for me to understand in a short time, because there is much code to be as generic as possible…
And the real action is scattered over many files.
I am emulating a recursive depth first search by fully expanding each level of the tree as it is encountered. When a node in the tree is processed we get information on it but also we know if there is another level hanging below it that needs expanding. Typically after 2 or 3 levels the beam width becomes too big to deal with all of it. I have an implementation defined limit of 1.8 million (it could be increased). Only 1.8 items are passed to the GPU at a time. The rest are queued on the PC. New items to search (from the next level down) are appended to the host queue of work to be done.
I would be very interested to learn why frontier based algs are slow.
Please excuse the typing, I am away fron my desk at present.