For the “no more acc accelerator timing” .
Alway with the same sample “f3.exe” .
Following the “pgprof13ug.pdf” page 22-25 .
With pgi/12.10 ALL OK
pgfortran --version
pgfortran 12.10-0 64-bit target on x86-64 Linux -tp nehalem
=> compilation/exec of the sample f3.exe with options “ccff & -g ,etc”
pwd
/home/escj/dir_PGF/PGI_HOME/linux86-64/13.1/etc/samples/accel
pgfortran -g -o f3.exe f3.f90 -ta=nvidia -Minfo=accel,ccf,all -fast
...
pgcollect -time -cudainit f3.exe 5000
0 errors found
450158 microseconds on GPU
305996 microseconds on host
target process has terminated, writing profile data
pgprof -exe f3.exe
The PGPROF window show very similar view as Fig 2.12 p.24 of pgprof13ug.pdf
=> 4 columns
less pg.txt
Profiled: ./f3.exe on Wed Feb 06 10:39:31 CET 2013
| Function | Seconds | Accelerator Region Time | Accelerator Kernel Time |
| __select_nocancel | 1,3908 = 46% | 0 = 0% | 0 = 0% |
| main | 8046 = 27% | 0 = 0% | 0 = 0% |
| smoothhost | 3448 = 11% | 0 = 0% | 0 = 0% |
| __GI_sched_yield | 3448 = 11% | 0 = 0% | 0 = 0% |
| sstk | 460 = 2% | 0 = 0% | 0 = 0% |
| __c_mcopy4 | 460 = 2% | 0 = 0% | 0 = 0% |
| __lll_lock_wait_private | 115 = 0% | 0 = 0% | 0 = 0% |
| do_lookup_x | 115 = 0% | 0 = 0% | 0 = 0% |
| smooth | 0 = 0% | 7663 = 100% | 3158 = 100% |
The smooth replace the mm1 function of the user guide doc .
And diving in smooth show where in the subroutine the time is spend on region &k ernel accelarated by directives
less smooth.txt
Profiled: ./f3.exe on Wed Feb 06 10:39:31 CET 2013
| Line | Source | Seconds | Accelerator Region Time | Accelerator Kernel Time |
| | subroutine smooth( a, b, w0, w1, w2, n, m, niters ) | 0 = 0% | 0 = 0% | 0 = 0% |
| | real, dimension(:,:) :: a,b | 0 = 0% | 0 = 0% | 0 = 0% |
| | real :: w0, w1, w2 | 0 = 0% | 0 = 0% | 0 = 0% |
| | integer :: n, m, niters | 0 = 0% | 0 = 0% | 0 = 0% |
| | integer :: i, j, iter | 0 = 0% | 0 = 0% | 0 = 0% |
| | !$acc data region copy(a(:,:)) copyin(b(:,:)) | 0 = 0% | 4501 = 59% | 0 = 0% |
| | do iter = 1,niters | 0 = 0% | 0 = 0% | 0 = 0% |
| | !$acc region | 0 = 0% | 3161 = 41% | 0 = 0% |
| | do i = 2,n-1 | 0 = 0% | 0 = 0% | 0 = 0% |
| | do j = 2,m-1 | 0 = 0% | 0 = 0% | 2077 = 66% |
| | a(i,j) = w0 * b(i,j) + & | 0 = 0% | 0 = 0% | 0 = 0% |
With pgi/13.1, PB NO MORE DATA/KERNEL COLUMN
pgfortran --version
pgfortran 13.1-1 64-bit target on x86-64 Linux -tp nehalem
pgfortran -g -o f3.exe f3.f90 -ta=nvidia -Minfo=accel,ccf,all -fast
...
pgcollect -time -cudainit f3.exe 5000
0 errors found
451390 microseconds on GPU
282197 microseconds on host
target process has terminated, writing profile data
pgprof -exe f3.exe
The sample spend 45139 ms on GPU but the PGPROG window show now :
less pg131.txt
Profiled: ./f3.exe on Wed Feb 06 11:02:06 CET 2013
| Function | Seconds |
| __select_nocancel | 1,3678 = 48% |
| main | 8046 = 28% |
| __GI_sched_yield | 3678 = 13% |
| smoothhost | 2989 = 10% |
| sstk | 230 = 1% |
| __lll_lock_wait_private | 115 = 0% |
=> No more smooth routine accelerate by acc directives , only the host one is shown …
=> No more region/kernel timing
Rem :
activing the “pgcollect -cuda” option give some info on the gpu kernel generated by the compiler …
but the profile obtained by this way is completely flatten and relation with the smooth source code is completely lost !
less pg131_cuda.txt
Profiled: ./f3.exe on Wed Feb 06 11:11:31 CET 2013
| Function | Seconds | CUDA GPU Secs | CUDA CPU Secs |
| __select_nocancel | 1,3596 = 48% | 0 = 0% | 0 = 0% |
| main | 7865 = 28% | 0 = 0% | 0 = 0% |
| __GI_sched_yield | 3596 = 13% | 0 = 0% | 0 = 0% |
| smoothhost | 3146 = 11% | 0 = 0% | 0 = 0% |
| sstk | 225 = 1% | 0 = 0% | 0 = 0% |
| __lll_lock_wait_private | 112 = 0% | 0 = 0% | 0 = 0% |
| smooth_28_gpu | 0 = 0% | 2128 = 57% | 1 = 0% |
| smooth_35_gpu | 0 = 0% | 1051 = 28% | 0 = 0% |
| memcpyDtoHasync | 0 = 0% | 194 = 5% | 197 = 34% |
| memcpyHtoDasync | 0 = 0% | 374 = 10% | 374 = 65% |
A+
Juan