simple CUDA implementation of the Floyd-Warshall graph algorithm

I posted on GitHub a simple CUDA implementation of one of my favorite graph algorithms, the Floyd-Warshall-All-Pairs-Shortest-Path algorithm with full path reconstruction.

https://github.com/OlegKonings/CUDA_Floyd_Warshall_

Definitely not an ‘embarrassingly parallel’ type problem, but still it outperforms the standard CPU version by 48+ times(including all GPU memory allocations and copies), and scales well.
Makes no sense to use it on small datasets, but if N>400 it provides a significant speed up.

If anybody has an idea of how to further improve the CUDA GPU implementation I would love to know about it. There is a version which uses matrix operations to get the result, but I believe that version does not provide full path reconstruction. There is also a CUDA BFS implementation out there which will outperform this algorithm if there is a low-level of connectivity in the graph.

The largest data set I have tested so far has been a 6000x6000 Matrix(36,000,000 elements) with apx 9,000,000 directed edges. Since I have had to create separate pairs of result Matrices(to verify correct) for the final shortest path Matrix and the Path reconstruction Matrix, I am at the memory limit for my home machine.

Hopefully will be testing on a K20 soon, so will update the project with those results at that time.

Any ideas for improving the speed are welcome!

I am not familiar with the algorithm, so can’t comment on the port in algorithmic terms. The work-horse kernel in this code appearss to be limited by global memory throughput (you can double check on that hypothesis with a profiler), and already uses the base+tid addresing pattern which makes for efficient global memory access.

The one thing I would suggest to try is to supplying the maximum of information about the kernel arguments to the compiler, by decorating the pointers arguments to the kernel with the modifiers “const” and “restrict” (see the Best Practices Guide) as applicable. This may allow some re-ordering of loads for slightly better performance, but for a kernel this small it probably makes no difference. Since it’s a trivial change, worth a try.

Will take your advice about the kernel arguments, thanks!

In general I am interested in finding ways of adapting graph algorithms to work on GPUs, and hopefully will build on this project.

Despite the issues I have been having with the bandwidth speed of the K20c, I still was able to get a 30% increase in speed for the CUDA version of this algorithm, and tested on a dense adjacency matrix of 10,000 x 10,000, and here is my initial output results:

//////////////////////////////////////////////////////////////////////////////////////////////////

Success! The GPU Floyd-Warshall result and the CPU Floyd-Warshall results are identical(both final adjacency matrix and path matrix).

N= 10000 , and the total number of elements(for Adjacency Matrix and Path Matrix) was 100000000 .
Matrices are int full dense format(row major) with a minimum of 25000000 valid directed edges.

The CPU timing for all was 3794.15 seconds, and the GPU timing(including all device memory operations(allocations,copies etc) ) for all was 77.8 seconds.

The GPU result was 48.56 faster than the CPU version.

//////////////////////////////////////////////////////////////////////////////////////////////////////

Will be using a similar implementation for other dynamic programming type algorithms which have been assumed not to work well in parallel model.

Can u plz provide the code for CUDa Flloyd Warshall…asap…

Are you five years old?
There is a link at the top of the page.

When i run this code of Dijkstra’s algo…two errors are there
1. no such file or directory exist…

2.Cuda build error…no tool found…

below is my code…
#include<cuda.h>
#include <time.h>
#include"stdafx.h"
#include<stdio.h>
#include<conio.h>
#include<math.h>
#include<stdlib.h>
#include<malloc.h>
#include<string.h>
#define MAXLINES 1000000

///////////////////////////////CODE FOR THESHOLD BASED GPU SHORTED PATH ALGO//////////////////////////
global void MAKETHRESHOLDINFINITY(int *INFINITY,int minimum)
{
int id=(blockIdx.x)
(blockDim.x)+threadIdx.x;
if(id=1)
{
*minimum=*INFINITY;
}
}

global void THRESHOLD(int Mask_gpu,int Node_gpu,int Node_weight_gpu,int Edge_gpu,int Weight_gpu,int *INFINITY,int *minimum)

{
int id=(blockIdx.x)*(blockDim.x)+threadIdx.x;

if((Mask_gpu[id]!=1 )&&( Node_weight_gpu[id]<(*INFINITY)))
  {
      atomicMin(minimum,Node_weight_gpu[id]);  
   }

}

global void RELAX(int Mask_gpu,int Node_gpu,int Node_weight_gpu,int Edge_gpu,int Weight_gpu,int minimum)
{
int id=blockIdx.x
blockDim.x+threadIdx.x;
int z;
if((Mask_gpu[id]!=1)&&(Node_weight_gpu[id]==*minimum))
{ Mask_gpu[id]=1;
for(z=Node_gpu[id];z<Node_gpu[id+1];z++)
atomicMin(&Node_weight_gpu[Edge_gpu[z]],(Node_weight_gpu[id] + Weight_gpu[z]));
}
}

///////////////////////////////////////////ADJECENCY LIST CODE PART/////////////////////////////////////////////

//READ ONE LINE OF FILE AT A TIME AND STORE IT IN ADJACENCY LINKED LIST.
//SORT THE ADJACENCY LINKED LIST AND STORE THEIR SORTED ELEMENTS “IIIT Hydrabad paper FORMATE”

struct node
{
int destination_node_number;
int edge_weight;
struct node *link;
};
int count(struct node *sort_temp);

void selection_sort(struct node *sort_temp,int m);

void display_list(struct node *sort_temp);

int adjacency_list(struct node *node_list[100],int edge_start_node,int edge_end_node,int edge_weight);

int main()
{
int iNumLines,INFINITY1= 1000000;
char filename[100];
int number_of_node=18;
printf(“Enter the file name:”);
gets(filename);
char *pszFile;
pszFile=filename;

    int node_weight[18];
	int edge_start_node;
	int edge_end_node;
	int edge_weight;
	struct node *node_list[19];
     
    for(int n=0;n<=18;n++)
	   node_list[n]=NULL;
		
	struct node *sort_temp;
    int m;
		
    FILE *file;
    char sz[256], *p;
    int iLineCur=0, temp=1;
      if((pszFile==0)||strlen(pszFile)==0)
        {
          printf("Error in formate of filename\n");
          return 0;
        }

      file=fopen(pszFile,"r");
      if(file==0)
       {
	      printf("Error: can't oprn %s",pszFile);
	      return 0;
        }

      while(iLineCur,MAXLINES)
        {
	     if(!fgets(sz,256,file)) break;
	     if(strlen(sz)==0) continue;

	     p=sz;

	     int start=1, count=1;
	     char *a,*b;
	     while(p[count]!=',')
	        {
		   count++;
	        }
	     a= new char[count];
	     a[count-1]='\0';
	     memcpy(a,&p[start],count-1);
	     count++;
	     start=count;
	     edge_start_node= atoi(a);
	     delete a;
	     while(p[count]!='\n')
	      {
		   count++;
	      }
	     b=new char[count-start+1];
	     b[count-start]='\0';
	     memcpy(b,&p[start],count-start);
	     edge_end_node=atoi(b);
	     delete b;

	     if(temp<=10)
	      { 
		   edge_weight=temp;
		   temp++;
	       }
	     else
	      {
		   temp=1;
		   edge_weight= temp;
		   temp++;
	      }
	   
	     printf("%d,%d,%d\n", edge_start_node, edge_end_node,edge_weight);

/////////////////READ ONE LINE OF FILE AND STORE IT IN ADJACENCY LINKED LIST////////////////////////////////////////////////

	     struct node *temp,*temp1, *r;
         if(node_list[edge_start_node]==NULL)
	        {
		     temp=(struct node*)malloc(sizeof(struct node));
		     temp->destination_node_number=edge_end_node;
		     temp->edge_weight=edge_weight;
		     temp->link=NULL;
		     node_list[edge_start_node]=temp;
		
		    //printf("%d,%d,%d,%u\n",edge_start_node[i],temp->destination_node_number,temp->edge_weight,temp);
	         }
	      else
	         {
		      temp1=node_list[edge_start_node];
		      while(temp1->link!=NULL)
			  temp1=temp1->link;

		      r=(struct node*)malloc(sizeof(struct node));
		      r->destination_node_number=edge_end_node;
		      r->edge_weight=edge_weight;
		      r->link=NULL;
		      temp1->link=r;
		   
	          }
	 		 
	      iLineCur++;
	  

        }//WHILE LOOP END


  
                     /////Define according to Graph size////

//////////////////////SORT THE ADJACENCY LINKED LIST AND STORE THEIR SORTED ELEMENTS “IIIT Hydrabad paper FORMATE”
int edge_index=0;
int Node[19];
int Edge[40];
int Weight[40];
int k;
for( k=0; k<18; k++)// here limit of k will be the number of nodes in graph.
{
sort_temp=node_list[k];
//printf("%u,%d\n", sort_temp,sort_temp->destination_node_number);
m=count(sort_temp);
selection_sort(sort_temp,m);
//display_list(sort_temp);
printf("%d=\n",k);
//store list in index of node array to edge array similar to IIIT Hydrabad paper(but in sorted formate)
if (node_list[k]!=NULL)
{
Node[k]=edge_index;
struct node *visit1=node_list[k];
while(visit1!=NULL)
{
Edge[edge_index]=visit1->destination_node_number;
Weight[edge_index]=visit1->edge_weight;
printf("%d–%d\n",Edge[edge_index],Weight[edge_index]);
visit1=visit1->link;
edge_index++;
}
}
else
{
Node[k]=edge_index;//if there is no outgoing edge for any node
}

       }
       
       Node[k]=edge_index; //an extra node array enrty to get the number of outgoing edges of last node 
       
       int dps;
      for(dps=0;dps<=18;dps++)
       printf("%d-",Node[dps]);
         printf("\n");
      for(dps=0;dps<40;dps++)
       printf("%d-",Edge[dps]);
        printf("\n");
       for(dps=0;dps<40;dps++)
       printf("%d-",Weight[dps]);
       printf("\n");

printf(“successfully read the file and number of lines are %d\n”, iLineCur);

///////////////////////////SERIEAL CODE FOR THRESHOLD BASED SSSP ALGORITHM////////////////////////////////////////

//////////////////////////////////////////////////////////////////////////////////////////////////////////////////

///////////////////////////DEFINE VARIABLES AND COLLINING CUDA FUNCTIONS FOR GPU EXECUTATION//////////////////////
int Mask[18],xyz;
for(xyz=0;xyz<18;xyz++)
Mask[xyz]=0;

int Node_weight[18],xyz1,minimum1=0;
for(xyz1=0;xyz1<18;xyz1++)
Node_weight[xyz1]=1000000;

   int source_node=3;
   Node_weight[3]=0;
          
   ///RELAX THE EDGES RELATED TO SOURECE NODE FIRST////////
   int x,y,z;
   x=Node[3];
   y=Node[4];
   Mask[3]=1;
   for(z=x;z<y;z++)
        {
   if( Node_weight[Edge[z]]>(Node_weight[3] + Weight[z]))
          {
             Node_weight[Edge[z]]=(Node_weight[3] + Weight[z]);
           }
         }

   
   ////////////////////////////////////////////////////////  
   

int *Mask_gpu,*Node_gpu,*Node_weight_gpu,*Edge_gpu,*Weight_gpu,*minimum,*INFINITY;

size_t size=(18)*sizeof(int);
size_t size3=(19)*sizeof(int);
size_t size1=sizeof(int);
size_t size2=(40)*sizeof(int);
cudaMalloc((void **) &Mask_gpu, size);
cudaMalloc((void **) &Node_gpu, size3);
cudaMalloc((void **) &Node_weight_gpu, size);
cudaMalloc((int **) &minimum, size1);
cudaMalloc((void **) &Edge_gpu, size2);
cudaMalloc((void **) &Weight_gpu, size2);
cudaMalloc((int **) &INFINITY, size1);

cudaMemcpy(Mask_gpu, Mask, size, cudaMemcpyHostToDevice);
cudaMemcpy(Node_gpu, Node, size3, cudaMemcpyHostToDevice);
cudaMemcpy(Node_weight_gpu, Node_weight, size, cudaMemcpyHostToDevice);
cudaMemcpy(minimum, &minimum1, size1, cudaMemcpyHostToDevice);
cudaMemcpy(INFINITY, &INFINITY1, size1, cudaMemcpyHostToDevice);
cudaMemcpy(Edge_gpu, Edge, size2, cudaMemcpyHostToDevice);
cudaMemcpy(Weight_gpu,Weight, size2, cudaMemcpyHostToDevice);

int block_size =18;
int n_blocks =1;

cudaEvent_t start,stop; 
cudaEventCreate(&start); 
cudaEventCreate(&stop); 

cudaEventRecord(start, 0); 

while(minimum1<INFINITY1)
{
MAKETHRESHOLDINFINITY<<<n_blocks, block_size>>> (INFINITY,minimum);
THRESHOLD<<< n_blocks, block_size >>> (Mask_gpu,Node_gpu,Node_weight_gpu,Edge_gpu,Weight_gpu,INFINITY,minimum);
cudaMemcpy(&minimum1,minimum, size1, cudaMemcpyDeviceToHost);
//printf("%d\n",minimum1);
RELAX<<< n_blocks, block_size >>> (Mask_gpu,Node_gpu,Node_weight_gpu,Edge_gpu,Weight_gpu,minimum);

 }
   
cudaEventRecord(stop, 0);  
cudaEventSynchronize(stop);  
float elapsedTime;  
cudaEventElapsedTime(&elapsedTime, start, stop);  
printf("time required : %f milliseconds\n", elapsedTime);  

cudaEventDestroy(start); 
cudaEventDestroy(stop); 

   
   
   
cudaMemcpy(Node_weight,Node_weight_gpu, size, cudaMemcpyDeviceToHost);
cudaMemcpy(Mask,Mask_gpu, size, cudaMemcpyDeviceToHost);

int spa2;
for(spa2=0;spa2<=17;spa2++)
   printf("%d\n",Node_weight[spa2]);
for(spa2=0;spa2<=17;spa2++)
   printf("%d\n",Mask[spa2]);

cudaFree(Mask_gpu);
cudaFree(Node_gpu);
cudaFree(Node_weight_gpu);
cudaFree(Edge_gpu);
cudaFree(INFINITY);
cudaFree(minimum);

//////////////////////////////////////////////////////////////////////////////////////////////////////////////////

getche();
return 0;

}

int count(struct node *sort_temp)
{
int c=0;
while(sort_temp!=NULL)
{
sort_temp=sort_temp->link;
c++;
}
return c;
}

void selection_sort(struct node *sort_temp,int m)
{
int x,y,z,temp2,temp3;
struct node *p, *q;
p=sort_temp;
if(m>1)
{
for(x=1;x<m;x++)
{
q=p->link;

  for(y=x+1;y<=m;y++)
  {
	  if(p->edge_weight> q->edge_weight)
	  {
		  temp2=p->edge_weight;
		  temp3=p->destination_node_number;
		  p->edge_weight=q->edge_weight;
		  p->destination_node_number=q->destination_node_number;
		  q->edge_weight=temp2;
		  q->destination_node_number=temp3;
	  }
	  q=q->link;
  }
  p=p->link;

}
}
//free§;
//free(q);
}

void display_list(struct node *sort_temp)
{
struct node *visit;
visit=sort_temp;

while(visit!=NULL)
{
	printf("%d,%d\n",visit->destination_node_number,visit->edge_weight);
	visit=visit->link;
}

}

This is a different algorithm, and not a question about an implementation of Floyd-Warshall in CUDA.

Also use code blocks if you really want help, rather than pasting the above.

Hi I implemented the blocked Floyd-Warshall in CUDA. Is more efficient that normal implementation.
https://github.com/MTB90/CUDA_Blocked_Floyd-Warshall (blocked-fw-cuda.cu)
But this is not finall version. This implementation is ~400 times faster that sequentional algorithm. Sorry for the mistakes in this text. Do not write good in English. :)

good to see someone posting code!

This version does not store the actual (step-by step) path of the each valid path, which is done in my implementation(as well as the value of the shortest path, which both implementations calculate).

That makes a big difference in the running time because this algorithm is memory bound.
In other words my implementation is doing as many as 2x the number of global allocations,reads and writes because it updating/storing the paths as well as the values.

Also I have an improved version as well which is about 50% faster (for full path reconstruction). For 4000 dense vertices on a GTX 780 it now takes 3 seconds for the round trip.

If your version is 400 times faster than this one then your running time for 4000 dense vertices (including paths) should be apx. 7.5 ms round trip time.

Where is the portion of the timing code?

Nice to see that someone is working with this algorithm.

Yes you have right this implementation dosen’t store the actual (step-by step) path of the each valid path if I have some time I will add this and then add to github information about times.

And I donn’t say that my code is 400 time faster that your is 400 times faster that sequentional code that also is in repository.

I use Intel® Core™ i7-2600K CPU @ 3.40GHz z Geforce GTX 480, for example in dense graph |V| = 10000 take 11,63s to calculate, the seqentional code take 5228s.

The times are not quite good measure I simple use time instruction on in linux (time ./app.out) so the times also add time to read data from file. So in few week I will add some table with times and add fully recunstruction path. This algorithm based on article http://arxiv.org/pdf/1001.4108.pdf

Can I use your portion of time code to measure time ? This way to measure time will be better.
I will add also some generated tests in repository. This is first version of blocked floyd warshall.

Hello CudaaduC!

I have download your implementation but I can’t get it work. I have read that you have improved it but I don’t know if the version at github it’s the good one of the first implementation.

It works fine in Windows 7/8, and there are just a few adjustments that need to be made for linux.

The blocked version referred to in this thread is faster for large N.

Just copied and pasted the code into Visual Studio, compiled and ran on a 1.2 GHz GTX 980 vs a 4.5 GHz i7 and this was the output for a small N=700;

Floyd-Warshall on CPU underway:
CPU Timing: 221ms

Floyd-Warshall on GPU underway:
GPU Timing(including all device-host, host-device copies, device allocations and freeing of device memory): 45ms

Verifying results of final adjacency Matrix and Path Matrix.
Adjacency Matrices Equal!
Path reconstruction Matrices Equal!
Enter start vertex #:300
Enter dest vertex(enter negative number to exit) #:400

Here is the shortest cost path from 300 to 400, at a total cost of 39.
From 300 to 511 at a cost of 2
From 511 to 630 at a cost of 23
From 630 to 400 at a cost of 14

Enter start vertex #:400
Enter dest vertex(enter negative number to exit) #:300

Here is the shortest cost path from 400 to 300, at a total cost of 57.
From 400 to 277 at a cost of 10
From 277 to 443 at a cost of 3
From 443 to 604 at a cost of 6
From 604 to 300 at a cost of 38

Enter start vertex #: