breadth first search (bfs)

I want to implement bfs in CUDA. I found a promising paper :“An Effective GPU Implementation of Breadth-First Search”.
Does anyone know where to get pseudocode or an implementiation example?
I already used the frontier Algorithm by P. Harish and P. J. Narayanan but it is slower than a fast CPU Implementation on graphs with big diameter. The algorithm is not work efficient.
I also found the paper: “Accelerating CUDA Graph Algorithms at Maximum Warp”. Did anyone implement the bfs Algorithm described in it?

Do you know more papers about efficient bfs CUDA implementations? Do you know which performs the best?

kind regards.

Code is available at, if you grab the source with svn

thank you very very much. That helps a lot!
I will read the paper and if I have further questions I will post them here.

The abstract is really promising. Thanks again!

I have implemented breadth first search to replace a traditional (CPU) depth first recursive search.
Results were not promising. Even running with a beam width of 1.8 million the GPU was
no faster than the orginal CPU.

Formal Concept Analysis on Graphics Hardware, W. B. Langdon, Shin Yoo and M. Harman, To be presented at CLA 2011


@ mfatica did you implement the algorithm described yourself? It seems very promising - thanks again! But it is hard to understand for me.
I got the source from svn. Its very much to read.
Is there a minimal example that implements just one of the shown strategies?
I have not much time (about 1 week). I need it for my thesis.
Do you know any easier algorithms that perform better than the frontier based?

@ wlangdon which algorithm did you use? Was it a frontier based one? They are slower than the CPU version on graphs with a big diameter and on uneven distributed graphs.
Did you read the paper mfatica posted? This seems to be fast. But I can’t tell yet.

It would really help me if I had a minimal code example. The source code available at google code is to complex for me to understand in a short time, because there is much code to be as generic as possible…
And the real action is scattered over many files.

I am emulating a recursive depth first search by fully expanding each level of the tree as it is encountered. When a node in the tree is processed we get information on it but also we know if there is another level hanging below it that needs expanding. Typically after 2 or 3 levels the beam width becomes too big to deal with all of it. I have an implementation defined limit of 1.8 million (it could be increased). Only 1.8 items are passed to the GPU at a time. The rest are queued on the PC. New items to search (from the next level down) are appended to the host queue of work to be done.

I would be very interested to learn why frontier based algs are slow.

Please excuse the typing, I am away fron my desk at present.


I had some problems getting hold of tech report CS-2011-05.
There is now a copy at
(its 18 pages and quite dense)

Dr. W. B. Langdon,
Department of Computer Science,
University College London
Gower Street, London WC1E 6BT, UK

EvoPAR 2012
EuroGP 2012 30 Nov
A Field Guide to Genetic Programming
GP Bibliography

Hi All,

We have implemented and open sourced one GPU based BFS which is order of magnitude faster than PPoPP '12 Scalable GPU Graph Traversal.

The source can be found here:
And the paper is “[SC '15] Enterprise: Breadth-First Graph Traversal on GPUs”