As long as the number of edges is less than 400,000 the algorithm works flawless. It’s pretty fast as is, but three things bother me:
Larger Edge counts cause a slight decrease in performance, but shouldn’t that not matter? Shouldn’t every one cycle, every edge is reviewed and updated if necessary?
Say about 350,000 Vertices and 1,000,000 edges: This should take roughly 350,000 cycles. I assumed GPUs were almost as fast as CPU and so ~350,000 cycles should be a cake walk… but this algo takes 28 seconds to compute. ALthough this is an incredible speedup from cpu, I guess I don’t understand what is taking the longest?
this is the most important. above 400,000 edges, the cpu and gpu versions are not exactly the same. Is there something in my code that is causing this?
my guess would be that your algorithm requires a global sync between step 1 and step 2.
You might think that all threads complete step 1 before any thread starts on step 2, but there is nothing in the CUDA programming model that guarantees that.
Also, __syncthreads() is not a global sync. So if you were expecting that where you are using it in the kernel, it won’t work the way you expect.
Allow me to quibble on a technicality. Global synchronization between steps 1 and 2 can be achieved by sticking step 1 into a first kernel and step 2 into a second kernel, then launching kernels 1 and 2 in sequence in the same stream.
Whether this kind of separation is possible (and if possible, practical and well-performing) here, I cannot say. In general, the only global (GPU-wide) synchronization mechanism consistent with the CUDA programming model is a kernel launch.
I expected that to be the problem as well after doing some reading on the situation.
// I setup a decent size:
int V = 100000; // Number of vertices in graph
int E = 300000; // Number of edges in graph
// setup dist linearly
for (int i = 0; i < V; i++)
dist[i] = INT_MAX;
dist[0] = 0;
// copy over to device
cudaMemcpy(d_edges, edges, E*sizeof(Edge), cudaMemcpyHostToDevice);
cudaMemcpy(d_dist, dist, V*sizeof(int), cudaMemcpyHostToDevice);
No dice:
CPU time: 262911
number of differences: 93778
Time for the kernel: 1290.504150 ms
Fixed. The problem is obvious now that I think about it, but it reduces the speed of the algorithm -_- still MUCH faster then cpu
PREVIOUS:
__global__ void BellmanFord(int V, int E, struct Edge* edges, int *dist, int src) {
int index = blockIdx.x * blockDim.x + threadIdx.x;
int stride = blockDim.x * gridDim.x;
// Step 1: Initialize distances from src to all other vertices
// as INFINITE. While a little wasteful, V will always be smaller then E
for (int i = index; i < V; i += stride)
dist[i] = INT_MAX;
dist[src] = 0;
// Step 2: Relax all edges |V| - 1 times. A simple shortest
// path from src to any other vertex can have at-most |V| - 1
// edges
for (int i = 1; i <= V - 1; i++) {
for (int j = index; j < E; j += stride) {
int u = edges[j].src;
int v = edges[j].dest;
int weight = edges[j].weight;
// NOTE: The issue here is that two threads might try to write to the
// same memory allocation (at the same time). To avoid this we should use atomics.
if (dist[u] != INT_MAX && dist[u] + weight < dist[v])
atomicMin(&dist[v], dist[u] + weight);
}
__syncthreads();
}
return;
}
NEW:
SetupDist<<<numBlocksV, blockSize>>>(V, d_dist, 0);
cudaDeviceSynchronize();
for (int i = 1; i <= V - 1; i++) {
BellmanFord<<<numBlocksE, blockSize>>>(V, E, d_edges, d_dist);
cudaDeviceSynchronize();
}
__global__ void BellmanFord(int V, int E, struct Edge* edges, int *dist) {
int index = blockIdx.x * blockDim.x + threadIdx.x;
int stride = blockDim.x * gridDim.x;
// Step 2: Relax all edges |V| - 1 times. A simple shortest
// path from src to any other vertex can have at-most |V| - 1
// edges
for (int j = index; j < E; j += stride) {
int u = edges[j].src;
int v = edges[j].dest;
int weight = edges[j].weight;
// NOTE: The issue here is that two threads might try to write to the
// same memory allocation (at the same time). To avoid this we should use atomics.
if (dist[u] != INT_MAX && dist[u] + weight < dist[v])
atomicMin(&dist[v], dist[u] + weight);
}
return;
}
Just like you said for step1, there is also an issue for step2 that I linearly iterate inside device code.