Hi guys, I tried to implement the Floyd algorithm, but somehow I am not getting the right results. When I compare the results in a matrix(d_predMatrix) containing the predecessors of the paths calculated by my cuda floyd algorithm with the ones thrown by the secuencial floyd algorithm they are not always the same, and sometimes they are equal, I am thinking that perhaps it could be that kernel launches are executed asynchronously, but I tested with the function cudaThreadSynchronize() before checking for errors in the kernel launch, but I still have wrong results and with a little more time delay… so, could anybody point me out what am I missing here, cause I must be doing something wrong. I really will appreciate your help. my code is as follows:
CPU
[codebox] //N: number of vertices in the graph
//d_weightMatrix and d_predMatrix are 1D matrices of size N*N
dimBlock.x = min(16,N);
dimBlock.y = dimBlock.x;
dimGrid.x = (int)ceil((float)N/dimBlock.x);
dimGrid.y = dimGrid.x;
for(int u = 0; u < N ; u++)
{
floyd<<< dimGrid, dimBlock >>>(d_weightMatrix, d_predMatrix, N,u);
checkCUDAError("Error executing kernel\n");
}[/codebox]
GPU
[codebox]//Kernel
global void floyd(float* weight, int* pred, int N, int u)
{
unsigned int v = __mul24(blockDim.x,blockIdx.x) + threadIdx.x;
unsigned int w = __mul24(blockDim.y,blockIdx.y) + threadIdx.y;
if(v!=w && w!=u && v!=u && v < N && u < N && w < N)
{
if(weight[v*N + w] > weight[v*N + u] + weight[u*N + w])
{
weight[v*N + w] = weight[v*N + u] + weight[u*N + w];
pred[v*N + w] = u;
}
}
Well in Debug Emulation everything works out perfect, no matter how big I create the matrix is always correct, that’s why I thought it could be that when I am calling the kernel and passing in the parameters a intermediary the next call to the other kernel in the loop depends on the results of the execution of the last one so if they are running concurrently I get wrong results, and this doesn’t happen in the emulation mode.
I will make a test with an matrix of integer instead of floats to discard floating point errors, thanks a lot for the hints!
I will post the results back.
I don’t know anything about Floyd algorithms, and that indexing looks complicated… I really don’t have a clue what is going on here. It appears that you are both writing to and reading from the same array with varied indices. Is there ever a time when one thread writes to the array at a location that another thread reads from? That would cause a race condition on the device that would not be present in emulation mode, as emulation mode runs all threads sequentially.
Well the race condition could happened only if the kernel launches are asynchronously, which I though could be the problem since the results of the processing in one kernel depends on the comparisons made by all the kernels processed before, and in the same kernel launch it would never happened a write to the same address since I build the grid of the kernel of the size of the matrix or larger, letting the thread indexes to map which element to process, that’s why I compare if the variables v, u, w, are smaller than N, the size of the matrix, in case that I build it larger. Please if you still see something wrong please let me know, thank you.
david@quadro:~$ echo "Well the race condition could happened only if the kernel launches are asynchronously, which I though could be the problem since the results of the processing in one kernel depends on the comparisons made by all the kernels processed before, and in the same kernel launch it would never happened a write to the same address since I build the grid of the kernel of the size of the matrix or larger, letting the thread indexes to map which element to process, that's why I compare if the variables v, u, w, are smaller than N, the size of the matrix, in case that I build it larger" | wc -w
108
Congratulations. I have never seen an English sentence with more than 100 words before. On an unreadability scale of 1-10, I give that about 15. I tried reading it about half a dozen times and I couldn’t get to the end of it, let alone understand its contents.
While I am frowning at the harsh comment immediately above this post, I don’t have a clue what that sentence means either. My head is actually hurting after reading that several times.
Looking at it, though, I am convinced there is a race condition there.
[codebox]unsigned int v = __mul24(blockDim.x,blockIdx.x) + threadIdx.x;
unsigned int w = __mul24(blockDim.y,blockIdx.y) + threadIdx.y;[/codebox]
since I only write to weight[vN + w] and pred[vN + w], wich are 1D matrices, all threads will map to a completely different element in the matrix, so the writing will happen in different addresses per thread in the entire grid, please correct me if I am wrong.
The second matter is that if all kernel calls inside the loop happens asynchronously there will be, of course, a race condition, because the previous kernel still executing and there will be threads racing to write in the same address. That is exactly what I want to avoid, and also because all the comparisons that I do inside one kernel have to finish before executing the next.
In my opinion you should always be writing to a different array that you are reading from.
So you have a data set A that you read from and you write to a data set B.
Then in the next kernel invocation you read from data set B and write to data set A -
simply by swapping out the pointers you pass to the kernel.
That’s called a ping-pong - it should avoid any races that may be hidden in your current
code version.
I think no one’s fully understood the indexing you’re doing. If you posted a more complete
example including a test data set, this would help a lot.
At first I thought this section had a race condition, but on closer inspection it appears it doesn’t, because u is constant, and v and w cannot equal u, so the entire row weight[u*N + w] and the entire column weight[v*N + u] are “read only”, and these are the only values being read. So none of the threads are writing to locations that are being loaded by other threads.
if (v!=w && w!=u && v!=u && v < N && u < N && w < N) {
if (weight[v*N + w] > weight[v*N + u] + weight[u*N + w]) {
weight[v*N + w] = weight[v*N + u] + weight[u*N + w];
pred[v*N + w] = u;
}
}
I suspect your problem is elsewhere. For example, perhaps weight is not being initialized properly, and maybe in emulation it just happens to be zero, while on the device it is full of junk. Post more of your code and the problem may be apparent.
Well Jamie K, thanks a lot for remarking that there was no race condition when it seem it was. About coping the data to the weight matrix, I made a test in debug mode printing all the values that I previously created before, randomly just for testing purpose, and they were all there. I am pasting some of my code now here, but I will upload a test project that I am right now translating from Spanish so you can all test it.
This is what I am trying to do: the secuencial Floyd algorithm stores the length of the paths of a directed graph in a matrix where weightMatrix[v][w] contains the path length of going from vertex v to vertex w. The algorithms has a triple nested loop testing for all the paths which vertex is the best intermediary, in other words if is better going straight from v to w that going from v to u and from u to w, in the latest case he update the path length and the intermediary vertex, code is like this:
I only create a matrix in 1D and do all the calculations in parallel by changing the intermediary from one kernel call to another, almost like Floyd, so:
Creating random matrix for testing purpose so I can check if I am doing the correct calculations having several distinct cases:
That’s interesting that swapping technique, but that also would required the double data allocation, and in my case when I have to many vertices like a few thousands(almost 9000) my 9600 GTX with 512Mb will be a little short, jeje, thanks a lot I will have it in mind for next projects to avoid any problem like that.
Look at my last post to Jamie K, I made a more specific explanation, thanks a lot for the help!
That the driver executes each kernel call in order is perfect, right now I don’t remember when I read that in the programming guide, is there where they point it out?
Well look at the last post to Jamie K, I made a more specific explanation of the code so things may be a little clear, thanks a lot for the help!
I did some testing, and I believe the difference between the host and the device is due to roundoff. The host will carry 80 bits of precision in its registers, whereas the device will truncate to 32 bits (24 bit mantissa), and in a few cases this can make a difference. For example, consider if x is 1.0 and y is 0.000001, and you test whether x < x+y. With 32 bit floats, it will be false, because the two values x, and x+y will both evaluate to 1.0. But with 80 bits, the test is true: x is 1.0, while x+y is 1.000001.
If your example is very slightly modified, where the random weights are initialized as
float value = (float)(rand() % 10000);
then the differences disappear. I believe the reason is that there are never small fractional differences where 80 bits vs 32 bits will make a difference.
Also, in my tests, even with initialization value=(float)rand()/n, the differences are all in the intermediates and the weights themselves are identical.
I believe your algorithm is working correctly, and the differences are due to the differences in roundoff behavior between the host and the device.
My test code (same in spirit as original):
[codebox]
global void floyd(float *weight, int *interm, int N, int u) {
unsigned int v = __mul24(blockDim.x,blockIdx.x) + threadIdx.x;
unsigned int w = __mul24(blockDim.y,blockIdx.y) + threadIdx.y;
if (v!=w && w!=u && v!=u && v < N && u < N && w < N) {
if(weight[v*N + w] > weight[v*N + u] + weight[u*N + w]) {
weight[v*N + w] = weight[v*N + u] + weight[u*N + w];
interm[v*N + w] = u;
}
}
}
void floydTest() {
static const int N = 1000;
float *h_weight = new float[N*N];
int *h_interm = new int[N*N];
for (int i=0; i < N; i++) {
for (int j=0; j < N; j++) {
h_interm[i*N+j] = 0;
if (i == j) {
h_weight[i*N+j] = 0;
}
else {
h_weight[i*N+j] = (float)(rand() % 10000);
}
}
}
dim3 dimBlock, dimGrid;
dimBlock.x = min(16,N);
dimBlock.y = dimBlock.x;
dimGrid.x = (int)ceil((float)N/dimBlock.x);
dimGrid.y = dimGrid.x;
float *d_weight;
cudaMalloc((void**)&d_weight, N*N*sizeof(float));
cudaMemcpy(d_weight, h_weight, N*N*sizeof(float), cudaMemcpyHostToDevice);
int *d_interm;
cudaMalloc((void**)&d_interm, N*N*sizeof(int));
cudaMemcpy(d_interm, h_interm, N*N*sizeof(int), cudaMemcpyHostToDevice);
for(int u = 0; u < N ; u++) {
floyd<<< dimGrid, dimBlock >>>(d_weight, d_interm, N, u);
}
float *h_weight_result = new float[N*N];
int *h_interm_result = new int[N*N];
cudaMemcpy(h_weight_result, d_weight, N*N*sizeof(float), cudaMemcpyDeviceToHost);
cudaMemcpy(h_interm_result, d_interm, N*N*sizeof(float), cudaMemcpyDeviceToHost);
// calculate on host
for (int u=0;u < N; u++) {
for (int v=0;v < N; v++) {
for (int w=0;w < N; w++) {
if (h_weight[v*N+w] > h_weight[v*N+u] + h_weight[u*N+w]) {
h_weight[v*N+w] = h_weight[v*N+u] + h_weight[u*N+w];
h_interm[v*N+w] = u;
}
}
}
}
for (int i=0; i < N; i++) {
for (int j=0; j < N; j++) {
if (h_weight[i*N+j] != h_weight_result[i*N+j]) {
printf("weight difference\n");
break;
}
if (h_interm[i*N+j] != h_interm_result[i*N+j]) {
printf("intermediate difference\n");
break;
}
}
}
printf("done\n");
Your post is correct! thats was the problem from the beginning, thanks a lot! I test it my self with a weight matrix of type int to avoid that kind of trouble and the results where correct.
I have one query now: the single precision floating point in CPU is not with 32 bit? so the mantissa occupies 24 bits? How that is different from the single precision floating point in the GPU? The case is I would like to keep using floating point for the length of the path but I don’t need too much precision, what could I do in that case?
Also I have now another issues:
Why is so slow??? Sometimes is even slower than the CPU for a big number of N. I know I am calling a very large number of kernels but somehow it should be faster.
I will post my test result in a few moments.
Best regards for everybody! You all help me a lot, thank you!
Well I have made some test and I realize that some kernels are taking a way too long time to execute than others. I create a random wieght matrix with 5000 elements, and I run the code but this time like this:
Every 10 to 12 kernel launches it takes the next one a very long time to finish execution with even the same configuration launch, it’s that normal? Do you guys know any reason for that to happened?