Problems with BFS implementation based on atomicCAS and atomicAdd

I am trying to implement a parallel version of Breadth-first search using Nvidia CUDA.

My algorithm gives the correct result for any small graph that I have made, and with an undirected graph with about 1.9 million nodes. The problems occur when I try with larger graphs (more than 2 mill nodes). The result is varying for each time, and I get a lot of errors.

My question is if there is some obvious synchronization issue here I am missing? Or maybe something else. Is my use of atomicCAS or atomicAdd naive in some way?

A little explanation of my algorithm, and the most important code:

I use 2 queues, one read_queue containing the vertices that I want to explore the neighbors of, and a write_queue which will contain the vertices for the next iteration.

The goal is to calculate the cost array, the minimum number of edges to each node from the source. Each cost start at -1, and are updated as soon as they are discovered for the first time. I compare my resulting cost array with one from a sequential bfs.

I have made sure that I have enough threads for each vertex, and I have tried with block_size of everything from 32 to 1024. My GPU is Nvidia GeForce MX-150 with Compute Capability 6.1

__global__ void bfs(int vertices, int *offset, int *adj, int *cost, 
    int *read_queue, int rqueue_size, int *write_queue, int *wqueue_size) {

    int tid = blockDim.x * blockIdx.x + threadIdx.x;

    if (tid < rqueue_size) {

        int u = read_queue[tid];
        int next_cost = cost[u] + 1;

        for (int j = offset[u]; j < offset[u+1]; j++) {
            int v = adj[j];

            if (cost[v] == -1 && atomicCAS(&cost[v], -1, next_cost) == -1) {

                int pos = atomicAdd(wqueue_size, 1);
                write_queue[pos] = v;

And the initialization code

int vertices = graph->vertices;
int bytes = vertices * sizeof(int);
int block_size = 256;

int *d_cost;
int *adj;
int *offset;

int *read_queue;
int *write_queue;
int rqueue_size;
int *wqueue_size;

cudaMalloc(&offset, bytes + sizeof(int));
cudaMemcpy(offset, graph->offset, bytes + sizeof(int), cudaMemcpyHostToDevice);

cudaMalloc(&adj, graph->edges * sizeof(int));
cudaMemcpy(adj, graph->adj, graph->edges * sizeof(int), cudaMemcpyHostToDevice);

fill(h_cost, vertices, -1);
h_cost[source] = 0;

cudaMalloc(&d_cost, bytes);
cudaMemcpy(d_cost, h_cost, bytes, cudaMemcpyHostToDevice);

//Allocate the read and write queue, and insert the source vertex.
cudaMalloc(&read_queue, bytes);
cudaMemcpy(read_queue, &source, sizeof(int), cudaMemcpyHostToDevice);

cudaMalloc(&write_queue, bytes);
cudaMalloc(&wqueue_size, sizeof(int));
cudaMemset(wqueue_size, 0, sizeof(int));

int level = 0;
rqueue_size = 1; 
int blocks;

while (rqueue_size > 0) {

    blocks = (block_size + rqueue_size - 1) / block_size;

    bfs<<<blocks, block_size>>>(vertices, offset, adj, d_cost, read_queue, rqueue_size, write_queue, wqueue_size);

    cudaMemcpy(&rqueue_size, wqueue_size, sizeof(int), cudaMemcpyDeviceToHost);

    cudaMemset(wqueue_size, 0, sizeof(int));

    int *tmp = write_queue;
    read_queue = write_queue;
    write_queue = tmp;


Is it possible to provide enough code that we can compile and run?

Remember that the largest value an INT can hold is 2147483647. Can you checked you’re not hitting that limit anywhere?

Thank you.

I have checked for integer overflows, but it does not seem to be the problem.

Someone suggested that it might be because of a GPU watchdog timer. I turned it off without success.

I am currently using 3 graphs for testing,

Interestingly, running with the California Network on my laptop (GeForce MX-150) works fine, but the 2 others fail.

I am also testing on a university computer with Tesla K40m, here both California and Roadnet-USA works fine, but NOT the wikitalk graph (which is the only directed graph of them, and completely different structure), although not the largest.

Here is the code (apologize if its a mess, I am beginner at C and CUDA):

Appreciate any help or advice!

EDIT: Just tried to run on another university computer with a Tesla V100-SXM3-32GB

This runs successfully on all graphs! So anyone has a clue why my code doesnt work on the other GPU’s?

You might be hitting a kernel timeout on the other GPUs. To me, anyway, it seems possible that your experiment of disabling the watchdog timer may have failed simply because you didn’t correctly disable it.

On windows I turned off WDDM TDR as administrator in NSight Monitor and restarted (

I also tried to run with NVidia NSight Systems profiler, it doesn’t report any errors as far as I can tell, I don’t really know how to read this report though (

edit: result of nvprof on the computer with Tesla K40m

~/cuda-bfs $ nvprof ./bfs_queue graph_wikitalk
==1363== NVPROF is profiling process 1363, command: ./bfs_queue graph_wikitalk
parallel reached 7 levels
terminated with 401072 errors
==1363== Profiling application: ./bfs_queue graph_wikitalk
==1363== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   72.82%  816.50ms         8  102.06ms  2.2720us  473.99ms  bfs(int, int*, int*, int*, int*, int, int*, int*)
                   26.99%  302.59ms         4  75.649ms  2.1760us  258.14ms  [CUDA memcpy HtoD]
                    0.19%  2.1497ms         9  238.86us  2.0160us  2.1216ms  [CUDA memcpy DtoH]
                    0.00%  31.136us         9  3.4590us  3.0720us  4.2240us  [CUDA memset]
      API calls:   76.82%  1.12323s        13  86.402ms  20.000us  474.03ms  cudaMemcpy
                   23.09%  337.60ms         6  56.267ms  179.15us  334.71ms  cudaMalloc
                    0.04%  578.63us        96  6.0270us     202ns  227.68us  cuDeviceGetAttribute
                    0.03%  425.45us         8  53.181us  9.1040us  337.54us  cudaLaunchKernel
                    0.02%  271.97us         1  271.97us  271.97us  271.97us  cuDeviceTotalMem
                    0.00%  72.643us         9  8.0710us  3.8330us  29.015us  cudaMemset
                    0.00%  56.803us         1  56.803us  56.803us  56.803us  cuDeviceGetName
                    0.00%  9.1650us         1  9.1650us  9.1650us  9.1650us  cuDeviceGetPCIBusId
                    0.00%  2.2520us         3     750ns     290ns  1.1270us  cuDeviceGetCount
                    0.00%  1.2920us         2     646ns     332ns     960ns  cuDeviceGet

Have you confirmed you don’t have any memory issues using cuda-memcheck tools?

Yes, seems ok.

~/cuda-bfs $ cuda-memcheck bfs_queue graph_wikitalk
parallel reached 7 levels
terminated with 423370 errors
========= ERROR SUMMARY: 0 errors

~/cuda-bfs $ cuda-memcheck bfs_queue graph_roadnetUSA
parallel reached 6261 levels
========= ERROR SUMMARY: 0 errors

Another observation is that changing the block size heavily influences the number of errors. A block size of 32 gives more than twice as many errors as running with block size of 1024.

~/cuda-bfs $ cuda-memcheck bfs_queue graph_wikitalk 32
parallel reached 8 levels
terminated with 1167599 errors
========= ERROR SUMMARY: 0 errors

~/cuda-bfs $ cuda-memcheck bfs_queue graph_wikitalk 1024
parallel reached 7 levels
terminated with 431218 errors
========= ERROR SUMMARY: 0 errors

The Tesla V100 is the only GPU with Compute Capability 7.0, while the Tesla K40m has 3.5, and MX-150 has 6.1. Can this be related?

I have tried compiling with explicit -arch=sm_35 on K40m, and -arch=sm_61 on MX-150, but it didn’t help.

Okay, I was able to narrow it down to your read_queue and write_queue arrays. Where you do your asserts, if you print all values that don’t match. Then then print an arbitrary index (u) from the sequential and the parallel, you should not see anything in the parallel. Also, you should see the opposite for values that match. Therefore, I believe the issue is in your logic to store v in your CUDA kernel.

I hope that helps you out.

Im sorry I didn’t quite get that, what do you mean by not seeing anything in the parallel?

When I compare the results for the wiki graph, h_cost is just 1 different from seq_cost in a lot of cases, or the parallel version simply has not visited some nodes.

On the US roadnet graph however there doesnt seem to be any obvious pattern, the costs can either be close or quite far off.

Im just very confused with the fact that the code runs on the one GPU but not the others.

Thanks for your help

If I do this:

if (level % 2 == 0) {
    bfs<<<blocks, block_size>>>(vertices, offset, adj, d_cost, read_queue, rqueue_size, write_queue, wqueue_size);
} else {
    bfs<<<blocks, block_size>>>(vertices, offset, adj, d_cost, write_queue, rqueue_size, read_queue, wqueue_size);

instead of

//previous write queue is now the read queue.
int *tmp = write_queue;
read_queue = write_queue;
write_queue = tmp;

then it all seem to work. any idea why? Is it not possible to swap pointers that refer to device memory on the host reliably?

thanks guys

Your swap code is wrong.

you want something like:

int *tmp = write_queue;
write_queue = read_queue;
read_queue = tmp;

holy shit. cant believe it. thanks, I’ll be back again with more for sure