I’m continuing to run silly test programs to get a feel for what flies and what does not. The results are still utterly confusing. I’m starting to doubt everything, including myself and the timer, but if I am making a mistake, I can’t find it.

The C and FORTRAN programs are doing the same computations and are indeed producing the same results, but the run times would not make a good commercial for CUDA FORTRAN. [Bad news for me, is of course also a distinct possibility :-)]

Here are some result for relatively long runs:

CUDA C Mflops = 1.779e+05 n = 5.000e+06 nthread = 1024 nblock = 128 tot = 6.554e+11

FORTRAN Mflops = 1.087E+05 n = 5.000E+06 nthread = 1024 nblock = 128 tot = 6.554E+11

FORTRAN Mflops = 2.376E+03 n = 5.000E+05 nthread = 1024 nblock = 128 tot = 6.554E+10CUDA C Mflops = 1.764e+05 n = 5.000e+07 nthread = 1024 nblock = 128 tot = 6.554e+12

FORTRAN Mflops = 1.280E+05 n = 5.000E+07 nthread = 1024 nblock = 128 tot = 6.554E+12

FORTRAN Mflops = 2.407E+03 n = 5.000E+06 nthread = 1024 nblock = 128 tot = 6.554E+11CUDA C Mflops = 1.630e+05 n = 1.000e+07 nthread = 512 nblock = 128 tot = 6.554e+11

FORTRAN Mflops = 1.002E+05 n = 1.000E+07 nthread = 512 nblock = 128 tot = 6.554E+11

FORTRAN Mflops = 2.405E+03 n = 1.000E+06 nthread = 512 nblock = 128 tot = 6.554E+10CUDA C Mflops = 1.565e+05 n = 1.000e+08 nthread = 512 nblock = 128 tot = 6.554e+12

FORTRAN Mflops = 1.144E+05 n = 1.000E+08 nthread = 512 nblock = 128 tot = 6.554E+12

FORTRAN Mflops = 2.392E+03 n = 1.000E+07 nthread = 512 nblock = 128 tot = 6.554E+11CUDA C Mflops = 1.412e+05 n = 2.000e+07 nthread = 256 nblock = 128 tot = 6.554e+11

FORTRAN Mflops = 9.051E+04 n = 2.000E+07 nthread = 256 nblock = 128 tot = 6.554E+11

FORTRAN Mflops = 2.383E+03 n = 2.000E+06 nthread = 256 nblock = 128 tot = 6.554E+10CUDA C Mflops = 1.402e+05 n = 2.000e+08 nthread = 256 nblock = 128 tot = 6.554e+12

FORTRAN Mflops = 1.037E+05 n = 2.000E+08 nthread = 256 nblock = 128 tot = 6.554E+12

FORTRAN Mflops = 2.388E+03 n = 2.000E+07 nthread = 256 nblock = 128 tot = 6.554E+11

The main result is Mflops, the number (in units of 1,000,000) of floating point operations per second. Each run is done three times: the first one is C on the GPU; the second two are FORTRAN: number 2 on the GPU and number 3 in emulation mode on a CPU. For these runs at least the GPU produces real speed by almost two orders of magnitude. FORTRAN is between 30% to 60% slower.

Here are similar results for shorter runs:

FORTRAN Mflops = 2.158E+03 n = 1.600E+06 nthread = 32 nblock = 256 tot = 1.311E+10

CUDA C Mflops = 8.683e+04 n = 1.600e+06 nthread = 32 nblock = 256 tot = 1.311e+10

FORTRAN Mflops = 1.112E+04 n = 1.600E+06 nthread = 32 nblock = 256 tot = 1.311E+10FORTRAN Mflops = 2.320E+03 n = 8.000E+05 nthread = 64 nblock = 256 tot = 1.311E+10

CUDA C Mflops = 1.347e+05 n = 8.000e+05 nthread = 64 nblock = 256 tot = 1.311e+10

FORTRAN Mflops = 1.227E+04 n = 8.000E+05 nthread = 64 nblock = 256 tot = 1.311E+10FORTRAN Mflops = 2.321E+03 n = 4.000E+05 nthread = 128 nblock = 256 tot = 1.311E+10

CUDA C Mflops = 1.393e+05 n = 4.000e+05 nthread = 128 nblock = 256 tot = 1.311e+10

FORTRAN Mflops = 1.216E+04 n = 4.000E+05 nthread = 128 nblock = 256 tot = 1.311E+10FORTRAN Mflops = 2.307E+03 n = 2.000E+05 nthread = 256 nblock = 256 tot = 1.311E+10

CUDA C Mflops = 1.591e+05 n = 2.000e+05 nthread = 256 nblock = 256 tot = 1.311e+10

FORTRAN Mflops = 1.235E+04 n = 2.000E+05 nthread = 256 nblock = 256 tot = 1.311E+10FORTRAN Mflops = 2.292E+03 n = 1.000E+05 nthread = 512 nblock = 256 tot = 1.311E+10

CUDA C Mflops = 1.612e+05 n = 1.000e+05 nthread = 512 nblock = 256 tot = 1.311e+10

FORTRAN Mflops = 1.230E+04 n = 1.000E+05 nthread = 512 nblock = 256 tot = 1.311E+10FORTRAN Mflops = 2.275E+03 n = 5.000E+04 nthread = 1024 nblock = 256 tot = 1.311E+10

CUDA C Mflops = 1.812e+05 n = 5.000e+04 nthread = 1024 nblock = 256 tot = 1.311e+10

FORTRAN Mflops = 1.237E+04 n = 5.000E+04 nthread = 1024 nblock = 256 tot = 1.311E+10FORTRAN Mflops = 2.245E+03 n = 2.499E+04 nthread = 1024 nblock = 512 tot = 1.310E+10

CUDA C Mflops = 1.828e+05 n = 2.500e+04 nthread = 1024 nblock = 512 tot = 1.311e+10

FORTRAN Mflops = 1.219E+04 n = 2.499E+04 nthread = 1024 nblock = 512 tot = 1.310E+10FORTRAN Mflops = 2.190E+03 n = 1.250E+04 nthread = 1024 nblock = 512 tot = 6.552E+09

CUDA C Mflops = 1.777e+05 n = 1.250e+04 nthread = 1024 nblock = 512 tot = 6.554e+09

FORTRAN Mflops = 6.391E+03 n = 1.250E+04 nthread = 1024 nblock = 512 tot = 6.552E+09

On the GPU C beats FORTRAN sometimes by close to two orders of magnitude. Same program, same compilation:

nvcc -O3 speed-c.cu -o speedc

pgfortran -ta=nvidia -O3 speed-transpose.CUF -o speedGPU

pgfortran -Mcuda=emu -O3 speed-transpose.CUF -o speedCPU

Only difference: run times on the order of minutes rather than seconds.

Here is the C code

```
#include <stdio.h>
#include <sys/time.h>
static void HandleError( cudaError_t err,
const char *file,
int line ) {
if (err != cudaSuccess) {
printf( "%s in %s at line %d\n", cudaGetErrorString( err ),
file, line );
exit( EXIT_FAILURE );
}
}
#define HANDLE_ERROR( err ) (HandleError( err, __FILE__, __LINE__ ))
#define N 10000
__device__ int cong(long int *seed) {
const long int a = 123456789;
const long int b = 987654321;
*seed = a * *seed + b;
return (int) seed;
}
__global__ void crank( long int *seed, int n) {
int x = threadIdx.x + blockIdx.x * blockDim.x;
long int seedl = seed[x];
#pragma unroll 4
for(int i=0; i<n; i++) {
int r= cong( &seedl );
//seedl=123456789*seedl+987654321;
}
seed[x] = seedl;
}
int main( int argc, char *argv[] ) {
long int *seed;
long int *d_seed;
struct timeval tt1, tt2;
int ms,n;
float fms;
int nblock,nthread;
n = atoi(argv[1]);
nthread = atoi(argv[2]);
nblock = atoi(argv[3]);
seed = (long int *) malloc(nthread*nblock * sizeof(long int));
// allocate the memory on the GPU
HANDLE_ERROR( cudaMalloc( (void**)&d_seed, nthread*nblock * sizeof(long int) ) );
// fill the array 'seed' the CPU
for (int i=0; i< nthread * nblock; i++) {
seed[i] = i+1;
}
gettimeofday( &tt1, NULL );
// copy the array 'seed' to the GPU
HANDLE_ERROR( cudaMemcpy( d_seed, seed, nthread * nblock * sizeof(long int),
cudaMemcpyHostToDevice ) );
crank<<<nblock,nthread>>>( d_seed, n );
// copy the array 'd_seed' to CPU
HANDLE_ERROR( cudaMemcpy( seed, d_seed, nthread * nblock * sizeof(long int),
cudaMemcpyDeviceToHost ) );
gettimeofday( &tt2, NULL );
int k=0;
long int sum=0;
for (int i=0; i < nblock; i++) {
for (int j=0; j < nthread; j++) {
sum = sum + seed[k++];
//if( k < 10 || k > nthread * nblock - 10) {
//printf("seed = %ld; sum = %ld\n", seed[k-1], sum);
//}
}
}
printf("sum = %ld\n", sum);
ms = (tt2.tv_sec - tt1.tv_sec);
ms = ms * 1000000 + (tt2.tv_usec - tt1.tv_usec);
fms = ((float)ms)/1000000.f;
float tot = float(n) * float(nthread) * float(nblock);
printf( "CUDA C time =%f seconds\n", fms );
printf( "CUDA C Mflops =%10.3e n =%10.3e nthread =%6d nblock =%6d tot =%10.3e\n",
tot/(1000000.f*fms),(float)n,nthread,nblock,tot);
return 0;
}
```

and here the FORTRAN code

```
module d_sub
use cudafor
integer, parameter :: INTEGER8=8,INTEGER4=4
contains
attributes(global) subroutine crank(seed,n,nthread,nblock)
implicit none
integer(INTEGER8) :: seed(nblock,nthread)
integer(INTEGER8) :: seedl
integer, value :: nthread,nblock
integer(8), value :: n
integer(INTEGER4) :: x
integer(8) :: i,j
seedl=seed(blockidx%x,threadidx%x)
do i=1,n
!do i=1,n/16
! do j=1,16
x=cong(seedl)
! end do
end do
seed(blockidx%x,threadidx%x)=seedl
end subroutine crank
attributes(device) function cong(seed)
implicit none
integer(INTEGER4) :: cong
integer(INTEGER8) :: seed
seed=123456789_INTEGER8*seed+987654321_INTEGER8
! cong=ishft(seed,-32)
cong=seed
end function cong
subroutine callcrank(n,nthread,nblock)
implicit none
integer(INTEGER8), allocatable, device :: d_seed(:,:)
integer(INTEGER8),allocatable :: seed(:,:)
integer :: nthread,nblock,istat,i,j,k
integer(8) :: n,s
allocate(seed(nblock,nthread),d_seed(nblock,nthread),stat=istat)
if(istat /= 0) stop 'allocation failed'
k=0
do i=1,nblock
do j=1,nthread
k=k+1
seed(i,j)=k
end do
end do
d_seed=seed
call crank<<<nblock,nthread>>>(d_seed,n,nthread,nblock)
seed=d_seed
!s=0
!k=0
!do i=1,nblock
! do j=1,nthread
! s=s+seed(i,j)
! k=k+1
! if(k < 10 .or. k > nthread*nblock-10 ) write(*,*) s,seed(i,j)
! end do
!end do
write(*,*) 'sum = ',sum(seed)
end subroutine callcrank
end module d_sub
program main
use d_sub
implicit none
integer :: ticks,nthread,nblock,scale
integer(8) :: n,t1,t2
real :: time,tot
character*20 :: input
call getarg(1,input)
read(input,*) n
n=(n/16)*16
call getarg(2,input)
read(input,*) nthread
call getarg(3,input)
read(input,*) nblock
call system_clock(t1)
call callcrank(n,nthread,nblock)
call system_clock(t2,ticks)
time=real(t2-t1)/ticks
write(*,'((a),f8.2,(a))') 'CUDA FORTRAN time',time,' sec'
tot=real(n)*nthread*nblock
write(*,1000) 'FORTRAN Mflops =',tot/(10**6*time),&
' n =',real(n),&
' nthread =',nthread,&
' nblock =',nblock,&
' tot =',tot
1000 format((a),es10.3,(a),es10.3,(a),i6,(a),i6,(a),es10.3)
end program main
```