How properly counting a performance/program time ?

Hi Mates,

First of all I am using 8600 GT, and …
I am absolutely confused while I am using cutCreateTimer.
I wrote a program that makes a “dot multiplication” - each thread makes only one multiplication of 2 elements from two different matrixes. When I multiply f.i.

[1] matrix A [16 x 16 ] by B [16 x 1]

there is hardly no difference betwen multiplying

[2] matrix A [16384 x 16 ] by B [16384 x 1] ??

In the first case timer shows 0,293241 ms in the seconds 0,300312 ??

When I checked the overhead of cutCreateTimer function it was about 0,12 ms, and previously I thought that having much less than 768 threads in [1] the time of performance on a card is 0,293241 - 0,12 [ms], and I thought that if I use more than 768 threads in [2] ( as we know only 768 thread can run on the same time ), the performance will last much longer and I was surprised that there was hardly no difference whatsoever.

All in all I ask you for telling me :

  • what kind of timer to count a time of performance on device do you use in your programs ?

  • what does cudaThreadSynchronize() do ?

  • does anyone have an idea why the fact of having more than 768 threads did not exert any infuence in time of performance in my [2] case ?

Thank you for your help indeed,

Y.

In CUDA 1.0, a call to a kernel function is not waiting for completion ( it was blocking in the beta release).
In order to get accurate timing, you need to call cudaThreadSyncronize.

If the profiler is enabled (export CUDA_PROFILE=1) , the call is blocking, so you can achieve the same result without code modifications.

Dear mfatica,

/* I wrote you a private message but I am not sure if you got it so I decided to post it here as well */

So far I used this instructions to count the time :


timerMul =0;
CUT_SAFE_CALL(cutCreateTimer(&timerMul));
CUT_SAFE_CALL(cutStartTimer(timerMul));

// execute the kernel
cudaThreadSynchronize();
matrixDOTMul<<< grid, threads >>>(d_C, d_A, d_X, WA, WX, HA,0, norm);

// stop and destroy timerMul
CUT_SAFE_CALL(cutStopTimer(timerMul));
printf("\nProcessing time of MULTIPLICATION: %f (ms) n", cutGetTimerValue(timerMul));
CUT_SAFE_CALL(cutDeleteTimer(timerMul));


Where should I apply the cudaThreadSynchronize() ?

Should I still use the cutCreateTimer() etc, or there is another way to print / screen the time of kernel performance ?

Maybe there is a function that I can use inside kernel inspite of waisting time for overhead of functions ?

Thank you,

Y

Add the cudaThreadSynchronize() after you start the kernel, so after this line:
matrixDOTMul<<< grid, threads >>>(d_C, d_A, d_X, WA, WX, HA,0, norm);

the cutCreateTimer function runs locally on the CPU and has no influence on the device. So you should still use that if you want to print the execution time to screen.
You can use the profiler to get timing results as well (Which usually gives timings with less overhead).

Ok, I saw it in the alignedTypes.cu and implemented it to my program. Can some confident tell me that I am right in the following instructions ?

unsigned int hTimer;
CUT_SAFE_CALL(cutCreateTimer(&hTimer));
CUDA_SAFE_CALL( cudaThreadSynchronize() );
CUT_SAFE_CALL( cutResetTimer(hTimer) );
CUT_SAFE_CALL( cutStartTimer(hTimer) );
     
      matrixDotMul<<< grid, threads >>>(d_C, d_A, d_X, WA, WX, HA,0, norm);

CUT_CHECK_ERROR("testKernel() execution failed\n");
CUDA_SAFE_CALL( cudaThreadSynchronize() );
CUT_SAFE_CALL( cutStopTimer(hTimer) );

float gpuTime = cutGetTimerValue(hTimer);
printf("Time: %f ms \n", gpuTime);

TTFN,

Adam,