Floyd algorithm problem Floyd cuda implementation getting wrong results.

Hi guys, I tried to implement the Floyd algorithm, but somehow I am not getting the right results. When I compare the results in a matrix(d_predMatrix) containing the predecessors of the paths calculated by my cuda floyd algorithm with the ones thrown by the secuencial floyd algorithm they are not always the same, and sometimes they are equal, I am thinking that perhaps it could be that kernel launches are executed asynchronously, but I tested with the function cudaThreadSynchronize() before checking for errors in the kernel launch, but I still have wrong results and with a little more time delay… so, could anybody point me out what am I missing here, cause I must be doing something wrong. I really will appreciate your help. my code is as follows:

CPU

[codebox] //N: number of vertices in the graph

    //d_weightMatrix and d_predMatrix are 1D matrices of size N*N

dimBlock.x = min(16,N);

dimBlock.y = dimBlock.x;

dimGrid.x = (int)ceil((float)N/dimBlock.x);

dimGrid.y = dimGrid.x;

for(int u = 0; u < N ; u++)

{

	floyd<<< dimGrid, dimBlock >>>(d_weightMatrix, d_predMatrix, N,u);

	checkCUDAError("Error executing kernel\n");

}[/codebox]

GPU

[codebox]//Kernel

global void floyd(float* weight, int* pred, int N, int u)

{

unsigned int v = __mul24(blockDim.x,blockIdx.x) + threadIdx.x;

unsigned int w = __mul24(blockDim.y,blockIdx.y) + threadIdx.y;

if(v!=w && w!=u && v!=u && v < N && u < N && w < N)

{

	if(weight[v*N + w] > weight[v*N + u] + weight[u*N + w])

	{

		weight[v*N + w] = weight[v*N + u] + weight[u*N + w];

		pred[v*N + w] = u;

	}

}

}

[/codebox]

i only took a quick look but i can think of a couple of reasons for different results:

  1. if the differences are small then it is do to floating point error. nothing much to do about it, really shouldn’t matter.

  2. you are writing to the same spot in global memory twice, overwriting one result with another.(not sure of this you should check it out)

compile your code with emu debug, and start comparing the results from the cpu and gpu one at a time, then you will see were it falls apart.

Well in Debug Emulation everything works out perfect, no matter how big I create the matrix is always correct, that’s why I thought it could be that when I am calling the kernel and passing in the parameters a intermediary the next call to the other kernel in the loop depends on the results of the execution of the last one so if they are running concurrently I get wrong results, and this doesn’t happen in the emulation mode.

I will make a test with an matrix of integer instead of floats to discard floating point errors, thanks a lot for the hints!
I will post the results back.

Debug (emulation) actually runs on the CPU - hence the float results are almost the same as on the original code (without CUDA)

Furthermore the threads are more “ordered” and obviously aint running in thousands like on the GPU (in release) and therefore many times

debug(emulation) will hide different threads writing to the same place, race conditions etc…

EDIT - Just to support what erdooom wrote :)

eyal

I don’t know anything about Floyd algorithms, and that indexing looks complicated… I really don’t have a clue what is going on here. It appears that you are both writing to and reading from the same array with varied indices. Is there ever a time when one thread writes to the array at a location that another thread reads from? That would cause a race condition on the device that would not be present in emulation mode, as emulation mode runs all threads sequentially.

Well the race condition could happened only if the kernel launches are asynchronously, which I though could be the problem since the results of the processing in one kernel depends on the comparisons made by all the kernels processed before, and in the same kernel launch it would never happened a write to the same address since I build the grid of the kernel of the size of the matrix or larger, letting the thread indexes to map which element to process, that’s why I compare if the variables v, u, w, are smaller than N, the size of the matrix, in case that I build it larger. Please if you still see something wrong please let me know, thank you.

david@quadro:~$ echo "Well the race condition could happened only if the kernel launches are asynchronously, which I though could be the problem since the results of the processing in one kernel depends on the comparisons made by all the kernels processed before, and in the same kernel launch it would never happened a write to the same address since I build the grid of the kernel of the size of the matrix or larger, letting the thread indexes to map which element to process, that's why I compare if the variables v, u, w, are smaller than N, the size of the matrix, in case that I build it larger" | wc -w

108

Congratulations. I have never seen an English sentence with more than 100 words before. On an unreadability scale of 1-10, I give that about 15. I tried reading it about half a dozen times and I couldn’t get to the end of it, let alone understand its contents.

While I am frowning at the harsh comment immediately above this post, I don’t have a clue what that sentence means either. My head is actually hurting after reading that several times.

Looking at it, though, I am convinced there is a race condition there.

I was referring to a race condition among the many threads within a single kernel launch.

Race conditions cannot happen for multiple kernel launches in a row: the driver executes each kernel call in order.

[quote name=‘avidday’ post=‘538170’ date=‘May 6 2009, 12:08 PM’]

[codebox]unsigned int v = __mul24(blockDim.x,blockIdx.x) + threadIdx.x;

unsigned int w = __mul24(blockDim.y,blockIdx.y) + threadIdx.y;[/codebox]

since I only write to weight[vN + w] and pred[vN + w], wich are 1D matrices, all threads will map to a completely different element in the matrix, so the writing will happen in different addresses per thread in the entire grid, please correct me if I am wrong.

The second matter is that if all kernel calls inside the loop happens asynchronously there will be, of course, a race condition, because the previous kernel still executing and there will be threads racing to write in the same address. That is exactly what I want to avoid, and also because all the comparisons that I do inside one kernel have to finish before executing the next.

I hope this clear things up, best regards!

Sorry, MisterAnderson42 had just clear me up and I quote

So the second matter it’s out of discussion.

In my opinion you should always be writing to a different array that you are reading from.
So you have a data set A that you read from and you write to a data set B.
Then in the next kernel invocation you read from data set B and write to data set A -
simply by swapping out the pointers you pass to the kernel.

That’s called a ping-pong - it should avoid any races that may be hidden in your current
code version.

I think no one’s fully understood the indexing you’re doing. If you posted a more complete
example including a test data set, this would help a lot.

Christian

At first I thought this section had a race condition, but on closer inspection it appears it doesn’t, because u is constant, and v and w cannot equal u, so the entire row weight[u*N + w] and the entire column weight[v*N + u] are “read only”, and these are the only values being read. So none of the threads are writing to locations that are being loaded by other threads.

if (v!=w && w!=u && v!=u && v < N && u < N && w < N) {

  if (weight[v*N + w] > weight[v*N + u] + weight[u*N + w]) {

	weight[v*N + w] = weight[v*N + u] + weight[u*N + w];

	pred[v*N + w] = u;

  }

}

I suspect your problem is elsewhere. For example, perhaps weight is not being initialized properly, and maybe in emulation it just happens to be zero, while on the device it is full of junk. Post more of your code and the problem may be apparent.

Well Jamie K, thanks a lot for remarking that there was no race condition when it seem it was. About coping the data to the weight matrix, I made a test in debug mode printing all the values that I previously created before, randomly just for testing purpose, and they were all there. I am pasting some of my code now here, but I will upload a test project that I am right now translating from Spanish so you can all test it.

This is what I am trying to do: the secuencial Floyd algorithm stores the length of the paths of a directed graph in a matrix where weightMatrix[v][w] contains the path length of going from vertex v to vertex w. The algorithms has a triple nested loop testing for all the paths which vertex is the best intermediary, in other words if is better going straight from v to w that going from v to u and from u to w, in the latest case he update the path length and the intermediary vertex, code is like this:

[codebox]for (int u=0;u < vertexCount;u++)

for (int v=0;v < vertexCount;v++)

 for (int w=0;w < vertexCount;w++)

 {

if(weightMatrix[v][w]>(weightMatrix[v][u] + weightMatrix[u][w]))

{

 weightMatrix[v][w] = weightMatrix[v][u] + weightMatrix[u][w];

 intermediaryMatrix[v][w] = u;

}

 }[/codebox]

I only create a matrix in 1D and do all the calculations in parallel by changing the intermediary from one kernel call to another, almost like Floyd, so:

Creating random matrix for testing purpose so I can check if I am doing the correct calculations having several distinct cases:

[codebox]float initialValue = 1;

float *matrix = new float[n];

for (int i=0; i < n; i++)

matrix[i] = new float[n];

for(int i=0; i<n; i++)

for(int j=0; j<n; j++)

{

if(i==j)

matrix[i][j]= 0;

else

matrix[i][j]= initialValue;

}

for(int i = 0; i < n; i++)

for(int j = 0; j < n; j++)

{

if(matrix[i][j]==initialValue && matrix[i][j] != 0)

{

float value = (float)rand()/n;

if(value > 0)

matrix[i][j] = value;

else

matrix[i][j] = (float)(i + j);

}

}

[/codebox]

Creating 1D matrix, coping to device and executing the kernel:

[codebox]//N: number of vertices in the graph

float h_weightMatrix1D = new float[NN];

int h_intermMatrix1D = new int[NN];

for (int i=0; i < N; i++)

for (int j=0; j < N; j++)

{

h_weightMatrix1D[i*N + j] = matrix[i][j];

//since all paths are now direct

//they have no intermediary

h_intermMatrix1D[i*N + j] = 0;

}

//both sizes may be the same

//but I have them apart any way

unsigned int size_Mem_int = sizeof(int) * N * N;

unsigned int size_Mem_float = sizeof(float) * N * N;

int *d_intermediaries = NULL;

float *d_weight = NULL;

CUDA_SAFE_CALL(cudaMalloc((void**) &d_intermediaries, size_Mem_int));

CUDA_SAFE_CALL(cudaMalloc((void**) &d_weight, size_Mem_float));

CUDA_SAFE_CALL(cudaMemcpy(d_intermediaries, h_intermMatrix1D, size_Mem_int, cudaMemcpyHostToDevice));

CUDA_SAFE_CALL(cudaMemcpy(d_weight, h_weightMatrix1D, size_Mem_float, cudaMemcpyHostToDevice));

dimBlock.x = min(16,N);

dimBlock.y = dimBlock.x;

dimGrid.x = (int)ceil((float)N/dimBlock.x);

dimGrid.y = dimGrid.x;

for(int u = 0; u < N ; u++)

{

floyd<<< dimGrid, dimBlock >>>(d_weight, d_intermediaries, N,u);

checkCUDAError(“Error executing kernel\n”);

}

unsigned int timer = 0;

CUT_SAFE_CALL(cutCreateTimer(&timer));

CUT_SAFE_CALL(cutStartTimer(timer));

//right now I only care about the best

//path not the weight of it, so I just

//keep the intermediarie matrix

CUDA_SAFE_CALL(cudaMemcpy(h_intermMatrix1D, d_intermediaries, size_Mem_int, cudaMemcpyDeviceToHost));

CUT_SAFE_CALL(cutStopTimer(timer));

printf(“GPU processing time: %f (ms) \n”, cutGetTimerValue(timer));

CUT_SAFE_CALL(cutDeleteTimer(timer));

//create 2D matrix for third party user

int** intermMatrix = new int*[N];

for (unsigned int i=0; i < N; i++)

{

intermMatrix[i] = new int[N];

for(unsigned int j = 0; j < N; j++)

{

intermMatrix[i][j] = intermMatrix1D[i*N + j];

}

}

CUDA_SAFE_CALL(cudaFree(d_intermediaries));

CUDA_SAFE_CALL(cudaFree(d_weight));

return intermMatrix;

[/codebox]

I will compile this in other test project for you to run it.

Hope you can light me up! thanks a lot everybody!

External Image

That’s interesting that swapping technique, but that also would required the double data allocation, and in my case when I have to many vertices like a few thousands(almost 9000) my 9600 GTX with 512Mb will be a little short, jeje, thanks a lot I will have it in mind for next projects to avoid any problem like that.

Look at my last post to Jamie K, I made a more specific explanation, thanks a lot for the help!

That the driver executes each kernel call in order is perfect, right now I don’t remember when I read that in the programming guide, is there where they point it out?
Well look at the last post to Jamie K, I made a more specific explanation of the code so things may be a little clear, thanks a lot for the help!

tmurray take a look at my last post to Jamie K, it may clear things up, thanks a lot for the help!

I did some testing, and I believe the difference between the host and the device is due to roundoff. The host will carry 80 bits of precision in its registers, whereas the device will truncate to 32 bits (24 bit mantissa), and in a few cases this can make a difference. For example, consider if x is 1.0 and y is 0.000001, and you test whether x < x+y. With 32 bit floats, it will be false, because the two values x, and x+y will both evaluate to 1.0. But with 80 bits, the test is true: x is 1.0, while x+y is 1.000001.

If your example is very slightly modified, where the random weights are initialized as

float value = (float)(rand() % 10000);

then the differences disappear. I believe the reason is that there are never small fractional differences where 80 bits vs 32 bits will make a difference.

Also, in my tests, even with initialization value=(float)rand()/n, the differences are all in the intermediates and the weights themselves are identical.

I believe your algorithm is working correctly, and the differences are due to the differences in roundoff behavior between the host and the device.

My test code (same in spirit as original):

[codebox]

global void floyd(float *weight, int *interm, int N, int u) {

unsigned int v = __mul24(blockDim.x,blockIdx.x) + threadIdx.x;

unsigned int w = __mul24(blockDim.y,blockIdx.y) + threadIdx.y;

if (v!=w && w!=u && v!=u && v < N && u < N && w < N) {

	if(weight[v*N + w] > weight[v*N + u] + weight[u*N + w]) {

		weight[v*N + w] = weight[v*N + u] + weight[u*N + w];

		interm[v*N + w] = u;

	}

}

}

void floydTest() {

static const int N = 1000;

float *h_weight = new float[N*N];

int *h_interm = new int[N*N];

for (int i=0; i < N; i++) {

	for (int j=0; j < N; j++) {

		h_interm[i*N+j] = 0;

		if (i == j) {

			h_weight[i*N+j] = 0;

		}

		else {

			h_weight[i*N+j] = (float)(rand() % 10000);

		}

	}

}

dim3 dimBlock, dimGrid;

dimBlock.x = min(16,N);

dimBlock.y = dimBlock.x;

dimGrid.x = (int)ceil((float)N/dimBlock.x);

dimGrid.y = dimGrid.x;

float *d_weight;

cudaMalloc((void**)&d_weight, N*N*sizeof(float));

cudaMemcpy(d_weight, h_weight, N*N*sizeof(float), cudaMemcpyHostToDevice);

int *d_interm;

cudaMalloc((void**)&d_interm, N*N*sizeof(int));

cudaMemcpy(d_interm, h_interm, N*N*sizeof(int), cudaMemcpyHostToDevice);

for(int u = 0; u < N ; u++) {

	floyd<<< dimGrid, dimBlock >>>(d_weight, d_interm, N, u);

}

float *h_weight_result = new float[N*N];

int *h_interm_result = new int[N*N];

cudaMemcpy(h_weight_result, d_weight, N*N*sizeof(float), cudaMemcpyDeviceToHost);

cudaMemcpy(h_interm_result, d_interm, N*N*sizeof(float), cudaMemcpyDeviceToHost);

// calculate on host

for (int u=0;u < N; u++) {

	for (int v=0;v < N; v++) {

		for (int w=0;w < N; w++) {

			if (h_weight[v*N+w] > h_weight[v*N+u] + h_weight[u*N+w]) {

				h_weight[v*N+w] = h_weight[v*N+u] + h_weight[u*N+w];

				h_interm[v*N+w] = u;

			}

		}

	}

}

for (int i=0; i < N; i++) {

	for (int j=0; j < N; j++) {

		if (h_weight[i*N+j] != h_weight_result[i*N+j]) {

			printf("weight difference\n");

			break;

		}

		if (h_interm[i*N+j] != h_interm_result[i*N+j]) {

			 printf("intermediate difference\n");

			break;

		}

	}

}

printf("done\n");

}

[/codebox]

Your post is correct! thats was the problem from the beginning, thanks a lot! I test it my self with a weight matrix of type int to avoid that kind of trouble and the results where correct.

I have one query now: the single precision floating point in CPU is not with 32 bit? so the mantissa occupies 24 bits? How that is different from the single precision floating point in the GPU? The case is I would like to keep using floating point for the length of the path but I don’t need too much precision, what could I do in that case?

Also I have now another issues:

Why is so slow??? Sometimes is even slower than the CPU for a big number of N. I know I am calling a very large number of kernels but somehow it should be faster.

I will post my test result in a few moments.
Best regards for everybody! You all help me a lot, thank you!

Well I have made some test and I realize that some kernels are taking a way too long time to execute than others. I create a random wieght matrix with 5000 elements, and I run the code but this time like this:

[codebox]unsigned int interval = 0;

for(int u = 0; u < N ; u++)

{

CUT_SAFE_CALL(cutCreateTimer(&interval));

CUT_SAFE_CALL(cutStartTimer(interval));

floyd<<< dimGrid, dimBlock >>>(d_weight, d_intermediaries, N,u);

checkCUDAError("Error executing kernel\n");



CUT_SAFE_CALL(cutStopTimer(interval));

printf("Processing intermediate %i: %f (ms) \n", u, cutGetTimerValue(interval));

CUT_SAFE_CALL(cutDeleteTimer(interval));

}[/codebox]

This is how a screenshot would look with the results:

[codebox]Processing intermediate 269: 0.010490 (ms)

Processing intermediate 270: 1174.885620(ms)

Processing intermediate 271: 0.009435(ms)

Processing intermediate 272: 0.009347(ms)

Processing intermediate 273: 0.009282(ms)

Processing intermediate 274: 0.009763(ms)

Processing intermediate 275: 0.009295(ms)

Processing intermediate 276: 0.010834(ms)

Processing intermediate 277: 0.009549(ms)

Processing intermediate 278: 762.004874(ms)

Processing intermediate 279: 0.010436(ms)

Processing intermediate 280: 0.008085(ms)[/codebox]

Every 10 to 12 kernel launches it takes the next one a very long time to finish execution with even the same configuration launch, it’s that normal? Do you guys know any reason for that to happened?

As always I will hope for your opinion, thanks.