erratic results, but C always beats FORTRAN

I’m continuing to run silly test programs to get a feel for what flies and what does not. The results are still utterly confusing. I’m starting to doubt everything, including myself and the timer, but if I am making a mistake, I can’t find it.

The C and FORTRAN programs are doing the same computations and are indeed producing the same results, but the run times would not make a good commercial for CUDA FORTRAN. [Bad news for me, is of course also a distinct possibility :-)]

Here are some result for relatively long runs:

CUDA C Mflops = 1.779e+05 n = 5.000e+06 nthread = 1024 nblock = 128 tot = 6.554e+11
FORTRAN Mflops = 1.087E+05 n = 5.000E+06 nthread = 1024 nblock = 128 tot = 6.554E+11
FORTRAN Mflops = 2.376E+03 n = 5.000E+05 nthread = 1024 nblock = 128 tot = 6.554E+10

CUDA C Mflops = 1.764e+05 n = 5.000e+07 nthread = 1024 nblock = 128 tot = 6.554e+12
FORTRAN Mflops = 1.280E+05 n = 5.000E+07 nthread = 1024 nblock = 128 tot = 6.554E+12
FORTRAN Mflops = 2.407E+03 n = 5.000E+06 nthread = 1024 nblock = 128 tot = 6.554E+11

CUDA C Mflops = 1.630e+05 n = 1.000e+07 nthread = 512 nblock = 128 tot = 6.554e+11
FORTRAN Mflops = 1.002E+05 n = 1.000E+07 nthread = 512 nblock = 128 tot = 6.554E+11
FORTRAN Mflops = 2.405E+03 n = 1.000E+06 nthread = 512 nblock = 128 tot = 6.554E+10

CUDA C Mflops = 1.565e+05 n = 1.000e+08 nthread = 512 nblock = 128 tot = 6.554e+12
FORTRAN Mflops = 1.144E+05 n = 1.000E+08 nthread = 512 nblock = 128 tot = 6.554E+12
FORTRAN Mflops = 2.392E+03 n = 1.000E+07 nthread = 512 nblock = 128 tot = 6.554E+11

CUDA C Mflops = 1.412e+05 n = 2.000e+07 nthread = 256 nblock = 128 tot = 6.554e+11
FORTRAN Mflops = 9.051E+04 n = 2.000E+07 nthread = 256 nblock = 128 tot = 6.554E+11
FORTRAN Mflops = 2.383E+03 n = 2.000E+06 nthread = 256 nblock = 128 tot = 6.554E+10

CUDA C Mflops = 1.402e+05 n = 2.000e+08 nthread = 256 nblock = 128 tot = 6.554e+12
FORTRAN Mflops = 1.037E+05 n = 2.000E+08 nthread = 256 nblock = 128 tot = 6.554E+12
FORTRAN Mflops = 2.388E+03 n = 2.000E+07 nthread = 256 nblock = 128 tot = 6.554E+11

The main result is Mflops, the number (in units of 1,000,000) of floating point operations per second. Each run is done three times: the first one is C on the GPU; the second two are FORTRAN: number 2 on the GPU and number 3 in emulation mode on a CPU. For these runs at least the GPU produces real speed by almost two orders of magnitude. FORTRAN is between 30% to 60% slower.

Here are similar results for shorter runs:

FORTRAN Mflops = 2.158E+03 n = 1.600E+06 nthread = 32 nblock = 256 tot = 1.311E+10
CUDA C Mflops = 8.683e+04 n = 1.600e+06 nthread = 32 nblock = 256 tot = 1.311e+10
FORTRAN Mflops = 1.112E+04 n = 1.600E+06 nthread = 32 nblock = 256 tot = 1.311E+10

FORTRAN Mflops = 2.320E+03 n = 8.000E+05 nthread = 64 nblock = 256 tot = 1.311E+10
CUDA C Mflops = 1.347e+05 n = 8.000e+05 nthread = 64 nblock = 256 tot = 1.311e+10
FORTRAN Mflops = 1.227E+04 n = 8.000E+05 nthread = 64 nblock = 256 tot = 1.311E+10

FORTRAN Mflops = 2.321E+03 n = 4.000E+05 nthread = 128 nblock = 256 tot = 1.311E+10
CUDA C Mflops = 1.393e+05 n = 4.000e+05 nthread = 128 nblock = 256 tot = 1.311e+10
FORTRAN Mflops = 1.216E+04 n = 4.000E+05 nthread = 128 nblock = 256 tot = 1.311E+10

FORTRAN Mflops = 2.307E+03 n = 2.000E+05 nthread = 256 nblock = 256 tot = 1.311E+10
CUDA C Mflops = 1.591e+05 n = 2.000e+05 nthread = 256 nblock = 256 tot = 1.311e+10
FORTRAN Mflops = 1.235E+04 n = 2.000E+05 nthread = 256 nblock = 256 tot = 1.311E+10

FORTRAN Mflops = 2.292E+03 n = 1.000E+05 nthread = 512 nblock = 256 tot = 1.311E+10
CUDA C Mflops = 1.612e+05 n = 1.000e+05 nthread = 512 nblock = 256 tot = 1.311e+10
FORTRAN Mflops = 1.230E+04 n = 1.000E+05 nthread = 512 nblock = 256 tot = 1.311E+10

FORTRAN Mflops = 2.275E+03 n = 5.000E+04 nthread = 1024 nblock = 256 tot = 1.311E+10
CUDA C Mflops = 1.812e+05 n = 5.000e+04 nthread = 1024 nblock = 256 tot = 1.311e+10
FORTRAN Mflops = 1.237E+04 n = 5.000E+04 nthread = 1024 nblock = 256 tot = 1.311E+10

FORTRAN Mflops = 2.245E+03 n = 2.499E+04 nthread = 1024 nblock = 512 tot = 1.310E+10
CUDA C Mflops = 1.828e+05 n = 2.500e+04 nthread = 1024 nblock = 512 tot = 1.311e+10
FORTRAN Mflops = 1.219E+04 n = 2.499E+04 nthread = 1024 nblock = 512 tot = 1.310E+10

FORTRAN Mflops = 2.190E+03 n = 1.250E+04 nthread = 1024 nblock = 512 tot = 6.552E+09
CUDA C Mflops = 1.777e+05 n = 1.250e+04 nthread = 1024 nblock = 512 tot = 6.554e+09
FORTRAN Mflops = 6.391E+03 n = 1.250E+04 nthread = 1024 nblock = 512 tot = 6.552E+09

On the GPU C beats FORTRAN sometimes by close to two orders of magnitude. Same program, same compilation:

nvcc -O3 speed-c.cu -o speedc
pgfortran -ta=nvidia -O3 speed-transpose.CUF -o speedGPU
pgfortran -Mcuda=emu -O3 speed-transpose.CUF -o speedCPU

Only difference: run times on the order of minutes rather than seconds.

Here is the C code

#include <stdio.h>
#include <sys/time.h>
static void HandleError( cudaError_t err,
                         const char *file,
                         int line ) {
  if (err != cudaSuccess) {
    printf( "%s in %s at line %d\n", cudaGetErrorString( err ),
            file, line );
    exit( EXIT_FAILURE );
  }
}
#define HANDLE_ERROR( err ) (HandleError( err, __FILE__, __LINE__ ))

#define N   10000
__device__ int cong(long int *seed) {
  const long int a = 123456789;
  const long int b = 987654321;
  *seed = a * *seed + b;
    return (int) seed;
}

__global__ void crank( long int *seed, int n) {
  int x = threadIdx.x + blockIdx.x * blockDim.x;
  long int seedl = seed[x];
#pragma unroll 4
  for(int i=0; i<n; i++) {
    int r= cong( &seedl );
    //seedl=123456789*seedl+987654321;
  }
  seed[x] = seedl;
}

int main( int argc, char *argv[] ) {
  long int *seed;
  long int *d_seed;
  struct timeval tt1, tt2;
  int ms,n;
  float fms;
  int nblock,nthread;
  n = atoi(argv[1]);
  nthread = atoi(argv[2]);
  nblock = atoi(argv[3]);
  seed = (long int *) malloc(nthread*nblock * sizeof(long int));
  // allocate the memory on the GPU
  HANDLE_ERROR( cudaMalloc( (void**)&d_seed, nthread*nblock * sizeof(long int) ) );
  // fill the array 'seed' the CPU
  for (int i=0; i< nthread * nblock; i++) {
    seed[i] = i+1;
  }
  gettimeofday( &tt1, NULL );
  // copy the array 'seed' to the GPU
  HANDLE_ERROR( cudaMemcpy( d_seed, seed, nthread * nblock * sizeof(long int),
                            cudaMemcpyHostToDevice ) );
  crank<<<nblock,nthread>>>( d_seed, n );

  // copy the array 'd_seed' to CPU
  HANDLE_ERROR( cudaMemcpy( seed, d_seed, nthread * nblock * sizeof(long int),
                            cudaMemcpyDeviceToHost ) );
  gettimeofday( &tt2, NULL );
  int k=0;
  long int sum=0;
  for (int i=0; i < nblock; i++) {
    for (int j=0; j < nthread; j++) {
      sum = sum + seed[k++];
      //if( k < 10 || k > nthread * nblock - 10) {
      //printf("seed = %ld; sum = %ld\n", seed[k-1], sum);
      //}
    }
  }
  printf("sum = %ld\n", sum);
  ms = (tt2.tv_sec - tt1.tv_sec);
  ms = ms * 1000000 + (tt2.tv_usec - tt1.tv_usec);
  fms = ((float)ms)/1000000.f;
  float tot = float(n) * float(nthread) * float(nblock);
  printf( "CUDA C time =%f seconds\n", fms );
  printf( "CUDA C  Mflops =%10.3e n =%10.3e nthread =%6d nblock =%6d tot =%10.3e\n",
          tot/(1000000.f*fms),(float)n,nthread,nblock,tot);
  return 0;
}

and here the FORTRAN code

module d_sub
  use cudafor
  integer, parameter :: INTEGER8=8,INTEGER4=4
contains
  attributes(global) subroutine crank(seed,n,nthread,nblock)
   implicit none
    integer(INTEGER8) :: seed(nblock,nthread)
    integer(INTEGER8) :: seedl
    integer, value :: nthread,nblock
    integer(8), value :: n
    integer(INTEGER4) :: x
    integer(8) :: i,j
    seedl=seed(blockidx%x,threadidx%x)
    do i=1,n
       !do i=1,n/16
       !  do j=1,16
       x=cong(seedl)
       !  end do
    end do
    seed(blockidx%x,threadidx%x)=seedl
  end subroutine crank
  attributes(device) function cong(seed)
    implicit none
    integer(INTEGER4) :: cong
    integer(INTEGER8) :: seed
    seed=123456789_INTEGER8*seed+987654321_INTEGER8
    ! cong=ishft(seed,-32)
    cong=seed
  end function cong

  subroutine callcrank(n,nthread,nblock)
    implicit none
    integer(INTEGER8), allocatable, device :: d_seed(:,:)
    integer(INTEGER8),allocatable :: seed(:,:)
    integer :: nthread,nblock,istat,i,j,k
    integer(8) :: n,s
    allocate(seed(nblock,nthread),d_seed(nblock,nthread),stat=istat)
    if(istat /= 0) stop 'allocation failed'
    k=0
    do i=1,nblock
       do j=1,nthread
          k=k+1
          seed(i,j)=k
       end do
    end do
    d_seed=seed
    call crank<<<nblock,nthread>>>(d_seed,n,nthread,nblock)
    seed=d_seed
    !s=0
    !k=0
    !do i=1,nblock
    !   do j=1,nthread
    !      s=s+seed(i,j)
    !      k=k+1
    !      if(k < 10 .or. k > nthread*nblock-10 ) write(*,*) s,seed(i,j)
    !   end do
    !end do
    write(*,*) 'sum = ',sum(seed)
  end subroutine callcrank
end module d_sub

program main
 use d_sub
 implicit none
 integer :: ticks,nthread,nblock,scale
 integer(8) :: n,t1,t2
 real :: time,tot
 character*20 :: input
 call getarg(1,input)
 read(input,*) n
 n=(n/16)*16
 call getarg(2,input)
 read(input,*) nthread
 call getarg(3,input)
 read(input,*) nblock
 call system_clock(t1)
 call callcrank(n,nthread,nblock)
 call system_clock(t2,ticks)
 time=real(t2-t1)/ticks
 write(*,'((a),f8.2,(a))') 'CUDA FORTRAN time',time,' sec'
 tot=real(n)*nthread*nblock
 write(*,1000) 'FORTRAN Mflops =',tot/(10**6*time),&
      ' n =',real(n),&
      ' nthread =',nthread,&
      ' nblock =',nblock,&
      ' tot =',tot
1000 format((a),es10.3,(a),es10.3,(a),i6,(a),i6,(a),es10.3)
end program main

Hi Peter,

While there are a few differences between the C and Fortran versions, the one that I’ve found so far that has an impact on performs is the declaration of ‘n’, ‘i’, and ‘j’. In the C version you have these declared as int, while the Fortran version uses integer8. Changing these to integer4 will help the Fortran time by ~15%.

I show there’s still about a 9% difference after this change but I’m out of time of today so this will need to wait til next week.

  • Mat

Mat,

Thanks for pointing out those differences. I changed those variables to integer*4 in the FORTRAN version and reran the short runs. Did you get a 15% increase in speed on the long ones? For me, nothing of significance has changed … The orders of magnitude difference in performance survive. Here are the new timings. They may not measure what I think they measure, but the good news is that they are reproducible, if nothing else.

FORTRAN Mflops = 2.173E+03 n = 1.600E+06 nthread = 32 nblock = 256 tot = 1.311E+10
CUDA C Mflops = 8.683e+04 n = 1.600e+06 nthread = 32 nblock = 256 tot = 1.311e+10
FORTRAN Mflops = 1.186E+04 n = 1.600E+06 nthread = 32 nblock = 256 tot = 1.311E+10

FORTRAN Mflops = 2.317E+03 n = 8.000E+05 nthread = 64 nblock = 256 tot = 1.311E+10
CUDA C Mflops = 1.338e+05 n = 8.000e+05 nthread = 64 nblock = 256 tot = 1.311e+10
FORTRAN Mflops = 1.227E+04 n = 8.000E+05 nthread = 64 nblock = 256 tot = 1.311E+10

FORTRAN Mflops = 2.322E+03 n = 4.000E+05 nthread = 128 nblock = 256 tot = 1.311E+10
CUDA C Mflops = 1.416e+05 n = 4.000e+05 nthread = 128 nblock = 256 tot = 1.311e+10
FORTRAN Mflops = 1.228E+04 n = 4.000E+05 nthread = 128 nblock = 256 tot = 1.311E+10

FORTRAN Mflops = 2.304E+03 n = 2.000E+05 nthread = 256 nblock = 256 tot = 1.311E+10
CUDA C Mflops = 1.559e+05 n = 2.000e+05 nthread = 256 nblock = 256 tot = 1.311e+10
FORTRAN Mflops = 1.258E+04 n = 2.000E+05 nthread = 256 nblock = 256 tot = 1.311E+10

FORTRAN Mflops = 2.292E+03 n = 1.000E+05 nthread = 512 nblock = 256 tot = 1.311E+10
CUDA C Mflops = 1.617e+05 n = 1.000e+05 nthread = 512 nblock = 256 tot = 1.311e+10
FORTRAN Mflops = 1.248E+04 n = 1.000E+05 nthread = 512 nblock = 256 tot = 1.311E+10

FORTRAN Mflops = 2.276E+03 n = 5.000E+04 nthread = 1024 nblock = 256 tot = 1.311E+10
CUDA C Mflops = 1.809e+05 n = 5.000e+04 nthread = 1024 nblock = 256 tot = 1.311e+10
FORTRAN Mflops = 1.221E+04 n = 5.000E+04 nthread = 1024 nblock = 256 tot = 1.311E+10

FORTRAN Mflops = 2.216E+03 n = 2.499E+04 nthread = 1024 nblock = 512 tot = 1.310E+10
CUDA C Mflops = 1.825e+05 n = 2.500e+04 nthread = 1024 nblock = 512 tot = 1.311e+10
FORTRAN Mflops = 1.233E+04 n = 2.499E+04 nthread = 1024 nblock = 512 tot = 1.310E+10

FORTRAN Mflops = 2.191E+03 n = 1.250E+04 nthread = 1024 nblock = 512 tot = 6.552E+09
CUDA C Mflops = 1.779e+05 n = 1.250e+04 nthread = 1024 nblock = 512 tot = 6.554e+09
FORTRAN Mflops = 6.391E+03 n = 1.250E+04 nthread = 1024 nblock = 512 tot = 6.552E+09

Hi Peter,

I think understand what’s happening. On Linux, the OS power’s down GPU driver when not it use. It takes about 1 second per device to warm-up and occurs upon first use of the device. In both version, this cost is incurred when “d_seed” is allocated. However, in the CUDA C version, the malloc isn’t timed so the initialization cost isn’t being included.

I modified your CUDA C program so that the malloc and sum are included in the timing (See below) and I get the following times. Note that I have 4 attached devices.

FORTRAN Mflops = 2.718E+03 n = 1.600E+06 nthread =    32 nblock =   256 tot = 1.311E+10
CUDA C  Mflops = 2.741e+03 n = 1.600e+06 nthread =    32 nblock =   256 tot = 1.311e+10

Next, I’ll rerun the codes with the ‘pgcudainit’ utility running in the back ground. ‘pgcudainit’ keeps the devices open so that the initialization costs aren’t incurred.

FORTRAN Mflops = 2.348E+04 n = 1.600E+06 nthread =    32 nblock =   256 tot = 1.311E+10
CUDA C  Mflops = 2.664e+04 n = 1.600e+06 nthread =    32 nblock =   256 tot = 1.311e+10

So CUDA C is still a bit faster but not to the same degree as before.

  • Mat
#include <stdio.h>
#include <sys/time.h>
static void HandleError( cudaError_t err,
                         const char *file,
                         int line ) {
  if (err != cudaSuccess) {
    printf( "%s in %s at line %d\n", cudaGetErrorString( err ),
            file, line );
    exit( EXIT_FAILURE );
  }
}
#define HANDLE_ERROR( err ) (HandleError( err, __FILE__, __LINE__ ))

#define N   10000
__device__ int cong(long int *seed) {
  const long int a = 123456789;
  const long int b = 987654321;
  *seed = (a*(*seed)) + b;
    return (int) seed;
}

__global__ void crank( long int *seed, int n) {
  int x = threadIdx.x + blockIdx.x * blockDim.x;
  long int seedl = seed[x];
#pragma unroll 4
  for(int i=0; i<n; i++) {
    int r= cong( &seedl );
    //seedl=123456789*seedl+987654321;
  }
  seed[x] = seedl;
}

int main( int argc, char *argv[] ) {
  long int *seed;
  long int *d_seed;
  struct timeval tt1, tt2;
  int ms,n;
  float fms;
  int nblock,nthread;
  n = atoi(argv[1]);
  nthread = atoi(argv[2]);
  nblock = atoi(argv[3]);
  seed = (long int *) malloc(nthread*nblock * sizeof(long int));
  // allocate the memory on the GPU
  // fill the array 'seed' the CPU
  for (int i=0; i< nthread * nblock; i++) {
    seed[i] = i+1;
  }
  gettimeofday( &tt1, NULL );
  HANDLE_ERROR( cudaMalloc( (void**)&d_seed, nthread*nblock * sizeof(long int) ) );
  // copy the array 'seed' to the GPU
  HANDLE_ERROR( cudaMemcpy( d_seed, seed, nthread * nblock * sizeof(long int),
                            cudaMemcpyHostToDevice ) );
  crank<<<nblock,nthread>>>( d_seed, n );

  // copy the array 'd_seed' to CPU
  HANDLE_ERROR( cudaMemcpy( seed, d_seed, nthread * nblock * sizeof(long int),
                            cudaMemcpyDeviceToHost ) );
  int k=0;
  long int sum=0;
  for (int i=0; i < nblock; i++) {
    for (int j=0; j < nthread; j++) {
      sum = sum + seed[k++];
      //if( k < 10 || k > nthread * nblock - 10) {
      //printf("seed = %ld; sum = %ld\n", seed[k-1], sum);
      //}
    }
  }
  printf("sum = %ld\n", sum);
  gettimeofday( &tt2, NULL );
  ms = (tt2.tv_sec - tt1.tv_sec);
  ms = ms * 1000000 + (tt2.tv_usec - tt1.tv_usec);
  fms = ((float)ms)/1000000.f;
  float tot = float(n) * float(nthread) * float(nblock);
  printf( "CUDA C time =%f seconds\n", fms );
  printf( "CUDA C  Mflops =%10.3e n =%10.3e nthread =%6d nblock =%6d tot =%10.3e\n",
          tot/(1000000.f*fms),(float)n,nthread,nblock,tot);
  return 0;
}

Mat,

Your latest version of the program indeed no longer produces these wild differences in behavior of C and FORTRAN. Whether I run pgcudainit or not makes no difference. Indeed, it should not, because root runs “nvidia-smi -l --interval=59 -f /var/log/nvidia-smi.log” as soon as the system starts up, which should have the same effect.

My results seem to confirm that cudamalloc is the source of the problem, as you say. However, it seems to be misbehaving under all circumstances; it causes always causes big delays.

In addition, there seems to be the effect that the GPU needs to be doing a lot of work before it reaches its potential, as indicated by the difference in speed-up of the long and short runs. I find that surprising, because the short (on the order of a second) runs are not all that short, I would have thought, apparently incorrectly.

Thanks,
Peter.

Mat,

There still is an outstanding issue about why allocation of device memory produces enormous delays. pgcudainit does not solve this problem on our Redhat Linux system. In fact it has no effect whatsoever on our Redhat Linux system, probably because root at all times runs “nvidia-smi -l --interval=59 -f /var/log/nvidia-smi.log.”

I also notice that if I do to short runs allocating the same amount of device memory, the first one systematically takes half a second longer. Maybe this is because I do not deallocate the memory at the end of the run. Why would I? I’m assuming, perhaps mistakenly, that one can have memory leaks within one run, but not between two. BTW, I do not know how long-lived this half second effect is.

What’s going on here?

Peter.

Hi Peter,

I also notice that if I do to short runs allocating the same amount of device memory, the first one systematically takes half a second longer.

Is this also the first time the kernel is called? There is an init time the first time a kernel is launched since the kernel itself needs to be copied over the device. Though a half a second seems a bit excessive. I usually see an init time of .1 to .3 seconds.

  • Mat

Mat,

I have not had time to test out all possible variations and permutations. I just notice that if I run the same Fortran program twice in a row, the first time takes 0.5 seconds longer --yes really that much!-- than the next time. I probably prematurely blamed data memory allocation, because that was what we were talking about. Copying the code of the kernel to the device, of course is some form of memory allocation too that has to be taken into account too. Whatever it is, it ain’t pretty.

Actually, what “the first time a kernel is launched” really means is an interesting question too. In a big application that uses distributed resources, there may be many such “first” launches, simply because a previous kernel overwrote program and data memory on the device.

In my current example, I’m using NVIDIA’s CURAND in a program that mixes C and Fortran; it uses CURAND only for initalization, and I created my own version of the random number generator, because I would not know how one could in-line C in a Fortran kernel. Therefore, in this example, there are two kernel launches. Maybe that explains the difference between your 0.3 seconds and my 0.5 seconds delay?

Peter.
Unfortunately, there are more issues than I have time to address right now.

Hi Peter,

I’m in the boat, too much going on. Another application engineer sent me a general Parallel Linear Recurrence algorithm that I’m hoping I can convert to a GPU and help you speed-up your code. Just haven’t had time to get back to it, Sorry.

  • Mat