Floyd algorithm problem Floyd cuda implementation getting wrong results.

Kernel launches are asynchronous, so you need cudaThreadSynchronize() in there to actually make sure the kernel is done before stopping the timer.

Also: cutil should not be used by anyone. Do timing with cudaEvents instead of the CPU-side timers used in cutil; they are far more accurate.

Intel has been using 80 bit precision FPUs in the IA32 processors since the i387 was first introduced. All floating point arithmetic is performed internally using 80 bits and the rounded to the required result precision. It seems that some x86_64 compilers are now favouring using the vector units over the scalar FPUs, which are only double precision, but still the precision is greater than single, even when using single precision types.

The GPU is pretty much textbook IEEE-754 single precision floating point.

Thanks avidday, very useful information. By the way I made another test creating the float weigth matrix with no floating values, I cast to integer all the random numbers so it would be no precision floating point differences between the comparisons of the GPU and the CPU, and the results are correct like Jamie K confirm us.

I fill the matrix like this:

[codebox]for(int i = 0; i < n; i++)

for(int j = 0; j < n; j++)


if(matrix[i][j]==initialValue && matrix[i][j] != 0)


int value = rand()/n;

if(value > 0)

matrix[i][j] = (float)value;


matrix[i][j] = (float)(i + j);



Thanks a lot for the help! Greets!

I will include cudaEvents as soon as I can, excellent idea! it’s just that in the SDK projects they used cutil timer and I used in my code for timing.

About the kernel launches, well if they would run asynchronously I would not get the same results that the sequential algorithm in the CPU, but I found this on the programming guide which I found really confusing:

[codebox] Asynchronous Concurrent Execution

In order to facilitate concurrent execution between host and device, some runtime

functions are asynchronous: Control is returned to the application before the device

has completed the requested task. These are:

�� Kernel launches through global functions or cuLaunchGrid() and


Applications manage concurrency through streams. A stream is a sequence of

operations that execute in order. Different streams, on the other hand, may execute

their operations out of order with respect to one another or concurrently.

Any kernel launch, memory set, or memory copy function without a stream

parameter or with a zero stream parameter begins only after all preceding operations

are done, including operations that are part of streams, and no subsequent operation

may begin until it is done. Kernel launches for which no stream parameter is

provided and memory copies without an Async suffix are assigned to the default

zero stream.[/codebox]

This is what I get: The application will receive control before my kernel(wich is a global function) has completed and so it will stall in that a call to a next kernel launch, memory set, or memory copy function without a stream until the previous kernel is done, because I do kernel launches with zero stream parameter, right?

Now does that include cudaEvents? and what about cutil timers?

Here I upload a project so you all can test the time results for different vertices amount.
I think the slow down problem for high values of N is due to an overload thread schedule management given by the kernel configuration.

My test in my current Zotac Geforce 9600 GTX with 700Mhz clock, 1700 Mhz shader, 512Mb DDR3 were the following:

kernel best speed range: 65- 350 vertices
speed range:
65 vertices - 2x
160 vertices - 10x
180 vertices - 5x
200 vertices - 4x

It would be really helpful if anyone having a most powerful GPU would post his execution results, thank you.

A better kernel implementation for really huge vertices amount would be to increase the elements that a thread process maintaining a fixed optimal grid size depending on the count of multiprocessors of the current executing device.

What do you think!?
cudaFloydProject.rar (471 KB)

Is there a way to make your memory access patterns coalesced, or to use shared memory for some extra speed ups?

Yes there is, I am right now working on that, but it’s taking me too long to do it, I will post as soon as I have it ready, greetings!

kernel launches do not block, but the buffer has a certain depth. So the 762 ms is together with the small values above the total time for those kernels.

To do proper timing you need to put a CudaThreadSynchronize just after your kernel call.

That’s certainly true. I thought that the cudaThreadSynchronize was dropping the performance with all the waiting and that’s why I remove it. Now the algorithm it’s really slow, even more than the CPU, I don’t believe that is an obligation to use shared memory with coalesced memory access pattern to have at least some speed up.

Well let’s changed the kernel implementation and let’s see what happens. Best regards!