Floyd algorithm problem Floyd cuda implementation getting wrong results.

tmurray · May 7, 2009, 4:28pm

Kernel launches are asynchronous, so you need cudaThreadSynchronize() in there to actually make sure the kernel is done before stopping the timer.

Also: cutil should not be used by anyone. Do timing with cudaEvents instead of the CPU-side timers used in cutil; they are far more accurate.

avidday · May 7, 2009, 4:37pm

Intel has been using 80 bit precision FPUs in the IA32 processors since the i387 was first introduced. All floating point arithmetic is performed internally using 80 bits and the rounded to the required result precision. It seems that some x86_64 compilers are now favouring using the vector units over the scalar FPUs, which are only double precision, but still the precision is greater than single, even when using single precision types.

The GPU is pretty much textbook IEEE-754 single precision floating point.

Lermy · May 7, 2009, 4:57pm

Thanks avidday, very useful information. By the way I made another test creating the float weigth matrix with no floating values, I cast to integer all the random numbers so it would be no precision floating point differences between the comparisons of the GPU and the CPU, and the results are correct like Jamie K confirm us.

I fill the matrix like this:

[codebox]for(int i = 0; i < n; i++)

for(int j = 0; j < n; j++)

{

if(matrix[i][j]==initialValue && matrix[i][j] != 0)

{

int value = rand()/n;

if(value > 0)

matrix[i][j] = (float)value;

else

matrix[i][j] = (float)(i + j);

}

}[/codebox]

Thanks a lot for the help! Greets!

Lermy · May 7, 2009, 5:24pm

I will include cudaEvents as soon as I can, excellent idea! it’s just that in the SDK projects they used cutil timer and I used in my code for timing.

About the kernel launches, well if they would run asynchronously I would not get the same results that the sequential algorithm in the CPU, but I found this on the programming guide which I found really confusing:

[codebox]4.5.1.5 Asynchronous Concurrent Execution

In order to facilitate concurrent execution between host and device, some runtime

functions are asynchronous: Control is returned to the application before the device

has completed the requested task. These are:

ï¿½ï¿½ Kernel launches through global functions or cuLaunchGrid() and

cuLaunchGridAsync();

…

Applications manage concurrency through streams. A stream is a sequence of

operations that execute in order. Different streams, on the other hand, may execute

their operations out of order with respect to one another or concurrently.

Any kernel launch, memory set, or memory copy function without a stream

parameter or with a zero stream parameter begins only after all preceding operations

are done, including operations that are part of streams, and no subsequent operation

may begin until it is done. Kernel launches for which no stream parameter is

provided and memory copies without an Async suffix are assigned to the default

zero stream.[/codebox]

This is what I get: The application will receive control before my kernel(wich is a global function) has completed and so it will stall in that a call to a next kernel launch, memory set, or memory copy function without a stream until the previous kernel is done, because I do kernel launches with zero stream parameter, right?

Now does that include cudaEvents? and what about cutil timers?

Lermy · May 8, 2009, 3:01pm

Here I upload a project so you all can test the time results for different vertices amount.
I think the slow down problem for high values of N is due to an overload thread schedule management given by the kernel configuration.

My test in my current Zotac Geforce 9600 GTX with 700Mhz clock, 1700 Mhz shader, 512Mb DDR3 were the following:

kernel best speed range: 65- 350 vertices
speed range:
65 vertices - 2x
160 vertices - 10x
180 vertices - 5x
200 vertices - 4x

It would be really helpful if anyone having a most powerful GPU would post his execution results, thank you.

A better kernel implementation for really huge vertices amount would be to increase the elements that a thread process maintaining a fixed optimal grid size depending on the count of multiprocessors of the current executing device.

What do you think!?
cudaFloydProject.rar (471 KB)

cbuchner1 · May 8, 2009, 3:22pm

Is there a way to make your memory access patterns coalesced, or to use shared memory for some extra speed ups?

Lermy · May 9, 2009, 4:56am

Yes there is, I am right now working on that, but it’s taking me too long to do it, I will post as soon as I have it ready, greetings!

E.D_Riedijk · May 9, 2009, 7:54am

Well I have made some test and I realize that some kernels are taking a way too long time to execute than others. I create a random wieght matrix with 5000 elements, and I run the code but this time like this:

[codebox]unsigned int interval = 0;

for(int u = 0; u < N ; u++)

{
CUT_SAFE_CALL(cutCreateTimer(&interval));

CUT_SAFE_CALL(cutStartTimer(interval));

floyd<<< dimGrid, dimBlock >>>(d_weight, d_intermediaries, N,u);

checkCUDAError("Error executing kernel\n");



CUT_SAFE_CALL(cutStopTimer(interval));

printf("Processing intermediate %i: %f (ms) \n", u, cutGetTimerValue(interval));

CUT_SAFE_CALL(cutDeleteTimer(interval));
}[/codebox]

This is how a screenshot would look with the results:

[codebox]Processing intermediate 269: 0.010490 (ms)

Processing intermediate 270: 1174.885620(ms)

Processing intermediate 271: 0.009435(ms)

Processing intermediate 272: 0.009347(ms)

Processing intermediate 273: 0.009282(ms)

Processing intermediate 274: 0.009763(ms)

Processing intermediate 275: 0.009295(ms)

Processing intermediate 276: 0.010834(ms)

Processing intermediate 277: 0.009549(ms)

Processing intermediate 278: 762.004874(ms)

Processing intermediate 279: 0.010436(ms)

Processing intermediate 280: 0.008085(ms)[/codebox]

Every 10 to 12 kernel launches it takes the next one a very long time to finish execution with even the same configuration launch, it’s that normal? Do you guys know any reason for that to happened?

As always I will hope for your opinion, thanks.

kernel launches do not block, but the buffer has a certain depth. So the 762 ms is together with the small values above the total time for those kernels.

To do proper timing you need to put a CudaThreadSynchronize just after your kernel call.

Lermy · May 9, 2009, 8:27pm

That’s certainly true. I thought that the cudaThreadSynchronize was dropping the performance with all the waiting and that’s why I remove it. Now the algorithm it’s really slow, even more than the CPU, I don’t believe that is an obligation to use shared memory with coalesced memory access pattern to have at least some speed up.

Well let’s changed the kernel implementation and let’s see what happens. Best regards!