[Code Review] Basic Bellman Ford. Curious about errors in large sizes

Here is the code:
https://gist.github.com/CraigglesO/bd1c6413e435fef891d8f35f51f61f4e

As long as the number of edges is less than 400,000 the algorithm works flawless. It’s pretty fast as is, but three things bother me:

  1. Larger Edge counts cause a slight decrease in performance, but shouldn’t that not matter? Shouldn’t every one cycle, every edge is reviewed and updated if necessary?

  2. Say about 350,000 Vertices and 1,000,000 edges: This should take roughly 350,000 cycles. I assumed GPUs were almost as fast as CPU and so ~350,000 cycles should be a cake walk… but this algo takes 28 seconds to compute. ALthough this is an incredible speedup from cpu, I guess I don’t understand what is taking the longest?

  3. this is the most important. above 400,000 edges, the cpu and gpu versions are not exactly the same. Is there something in my code that is causing this?

Thanks,
Connor.

my guess would be that your algorithm requires a global sync between step 1 and step 2.

You might think that all threads complete step 1 before any thread starts on step 2, but there is nothing in the CUDA programming model that guarantees that.

Also, __syncthreads() is not a global sync. So if you were expecting that where you are using it in the kernel, it won’t work the way you expect.

Allow me to quibble on a technicality. Global synchronization between steps 1 and 2 can be achieved by sticking step 1 into a first kernel and step 2 into a second kernel, then launching kernels 1 and 2 in sequence in the same stream.

Whether this kind of separation is possible (and if possible, practical and well-performing) here, I cannot say. In general, the only global (GPU-wide) synchronization mechanism consistent with the CUDA programming model is a kernel launch.

Perhaps I should have said:

“as your code is written, there is nothing in the CUDA programming model that guarantees that”

Yes, the kernel launch is a grid-wide sync point.

Cooperative groups also provides another possible grid-wide sync mechanism.

I expected that to be the problem as well after doing some reading on the situation.

// I setup a decent size:
int V = 100000;  // Number of vertices in graph
int E = 300000;  // Number of edges in graph

// setup dist linearly
  for (int i = 0; i < V; i++)
    dist[i] = INT_MAX;
  dist[0] = 0;

  // copy over to device
  cudaMemcpy(d_edges, edges, E*sizeof(Edge), cudaMemcpyHostToDevice);
  cudaMemcpy(d_dist, dist, V*sizeof(int), cudaMemcpyHostToDevice);

No dice:
CPU time: 262911
number of differences: 93778
Time for the kernel: 1290.504150 ms

Fixed. The problem is obvious now that I think about it, but it reduces the speed of the algorithm -_- still MUCH faster then cpu

PREVIOUS:

__global__ void BellmanFord(int V, int E, struct Edge* edges, int *dist, int src) {
  int index = blockIdx.x * blockDim.x + threadIdx.x;
  int stride = blockDim.x * gridDim.x;
  // Step 1: Initialize distances from src to all other vertices
  // as INFINITE. While a little wasteful, V will always be smaller then E
  for (int i = index; i < V; i += stride)
    dist[i] = INT_MAX;
  dist[src] = 0;

  // Step 2: Relax all edges |V| - 1 times. A simple shortest
  // path from src to any other vertex can have at-most |V| - 1
  // edges
  for (int i = 1; i <= V - 1; i++) {
    for (int j = index; j < E; j += stride) {
      int u      = edges[j].src;
      int v      = edges[j].dest;
      int weight = edges[j].weight;
      // NOTE: The issue here is that two threads might try to write to the
      // same memory allocation (at the same time). To avoid this we should use atomics.
      if (dist[u] != INT_MAX && dist[u] + weight < dist[v])
        atomicMin(&dist[v], dist[u] + weight);
    }
    __syncthreads();
  }

  return;
}

NEW:

SetupDist<<<numBlocksV, blockSize>>>(V, d_dist, 0);
cudaDeviceSynchronize();
for (int i = 1; i <= V - 1; i++) {
  BellmanFord<<<numBlocksE, blockSize>>>(V, E, d_edges, d_dist);
  cudaDeviceSynchronize();
}

__global__ void BellmanFord(int V, int E, struct Edge* edges, int *dist) {
  int index = blockIdx.x * blockDim.x + threadIdx.x;
  int stride = blockDim.x * gridDim.x;
  // Step 2: Relax all edges |V| - 1 times. A simple shortest
  // path from src to any other vertex can have at-most |V| - 1
  // edges
  for (int j = index; j < E; j += stride) {
    int u      = edges[j].src;
    int v      = edges[j].dest;
    int weight = edges[j].weight;
    // NOTE: The issue here is that two threads might try to write to the
    // same memory allocation (at the same time). To avoid this we should use atomics.
    if (dist[u] != INT_MAX && dist[u] + weight < dist[v])
      atomicMin(&dist[v], dist[u] + weight);
  }

  return;
}

Just like you said for step1, there is also an issue for step2 that I linearly iterate inside device code.