A question on single and double precision performance calculation with CUDA cores

Hello,

I am working on developing a large GPU-enabled code using OpenACC in Fortran. Recently, we have tested our code performance using single precision and double precision variable size, and the results of that were surprising to us. We are using several NVIDIA RTX A5000 24Gb cards, and according to this post’s answer that links to the following GPU spec sheet, the performance of single precision to double precision when using CUDA cores has 64:1 ratio. We assumed that it means that there are 64 times more single precision compute units compared to double precision units, and we expected a comparable time difference when running a single precision problem vs a double precision problem. However, what we found in practice is that the single precision version of the code is almost exactly 2 times faster than the double precision version. Our naive interpretation was that maybe the GPU is performing some sort of “double-single” arithmetic as discussed in the last message of this post and combining 2 single precision computing units into one double precision unit.

Below we provide a sample Fortran code that illustrates our observation:

module declare_arrays
   use openacc
   use cudafor
   implicit none   
   integer(8),parameter          :: prec = 8           
   real(prec),allocatable        :: test_array(:,:)
   
   contains

end module declare_arrays


module run_code   
   implicit none

   contains

   subroutine allocate_and_run_test_problem 
      use declare_arrays
      use openacc
      use cudafor
      implicit none
      integer(8) :: i, j      
      integer(8) :: N, M
      integer :: istat, devicenum
      integer(kind=cuda_count_kind) :: free_mem_gpu, total_mem_gpu
      real(8) :: t1, t2, t3, t4

      ! these numbers should produce around 8 Gb problem with single precision
      ! and around 16 Gb problem for double precision
      N = 10000000
      M = 210

      ! using GPU#0, which is NVIDIA RTX A5000 (24Gb)
      devicenum = 0

      call acc_set_device_num(devicenum,acc_device_nvidia)

      call cpu_time(t1)

      ! allocate the array
      allocate(test_array(N,M))

      ! initial value for all array elements
      test_array = 4_prec

      call cpu_time(t2)

      write(*,*) 'Allocated ',N*M*prec/(1024_prec*1024_prec), ' Mbytes of GPU memory'
      write(*,*) 'Time of allocation and init:', (t2 - t1), 'sec'
      write(*,*) '-----------------------------------------------------------'

      write(*,*) 'maxval of data before GPU run:',maxval(test_array)
      write(*,*) 'minval of data before GPU run:',minval(test_array)
      write(*,*) '-----------------------------------------------------------'
      

      ! monitoring occupied memory on GPU (for current device)
      write(*,*) '------ Measuring memory consumption on GPU (copyin) ------'
      write(*,'(A14,A14,A20,A16,A17)') '              ','free (before) ','  occupied (before) ',&
              ' free (after)   ','occupied (after)'

      ! this piece of code goes before the memory copy to GPU
      istat = cudaMemGetInfo(free_mem_gpu, total_mem_gpu)
      write(*,'(A11,I1,A2,F10.3,A7,F10.3,A7)',advance='no') '  - Device ',devicenum,': ', &
                  real(free_mem_gpu,prec)/1024_prec/1024_prec, ' Mb    ',&
                  real(total_mem_gpu-free_mem_gpu,prec)/1024_prec/1024_prec,' Mb    '

      !$acc enter data copyin(test_array)

      ! this piece goes after the memory copy to show how much memory is used on GPU
      istat = cudaMemGetInfo(free_mem_gpu, total_mem_gpu)
      write(*,'(F10.3,A7,F10.3,A3)',advance='yes') real(free_mem_gpu,prec)/1024_prec/1024_prec, ' Mb    ',&
                  real(total_mem_gpu-free_mem_gpu,prec)/1024_prec/1024_prec,' Mb'


      call cpu_time(t3)
      ! several GPU loops are given here to emulate the structure of the code
      ! in our large code

      !$acc parallel loop collapse(2) present(test_array)
      do i = 1, M
         do j = 1, N
            test_array(j,i) = (test_array(j,i) * 2_prec - 1_prec)**2_prec
         enddo
      enddo

      !$acc parallel loop collapse(2) present(test_array)
      do i = 1, M
         do j = 1, N
            test_array(j,i) = test_array(j,i) - 4_prec
         enddo
      enddo

      !$acc parallel loop collapse(2) present(test_array)
      do i = 1, M
         do j = 1, N
            test_array(j,i) = (test_array(j,i) * 3_prec) - (test_array(j,i)/2_prec)
         enddo
      enddo

      !$acc parallel loop collapse(2) present(test_array)
      do i = 1, M
         do j = 1, N
            test_array(j,i) = test_array(j,i) - 4_prec
         enddo
      enddo

      call cpu_time(t4)

      write(*,*) '-----------------------------------------------------------'
      write(*,*) 'Time of execution on GPU:', (t4 - t3), 'sec'

      !$acc exit data copyout(test_array)

      write(*,*) 'maxval of data after GPU run:',maxval(test_array)
      write(*,*) 'minval of data after GPU run:',minval(test_array)
      
      if(allocated(test_array)) deallocate(test_array)

      call cpu_time(t4)

      write(*,*) '==========================================================='
      write(*,*) 'Total execution time:', (t4 - t1), 'sec'      

   end subroutine allocate_and_run_test_problem
end module run_code


program test_pinned_mem
   use run_code
   implicit none

   call allocate_and_run_test_problem
   

end program

In the code, we change the prec variable from 8 to 4 to change from double precision to single precision. Our compiler is HPC SDK 24.1, and our OS is Rocky Linux 8.7. To compile and run, we use the following SH script:

RED='\033[0;34m' #
NC='\033[0m' #
rm small_program
echo -e "${RED}Start compilation using nvfortran.${NC}" #
nvfortran -fast -O3 -mp -cuda -acc -gpu=cc86,deepcopy,cuda12.3,lineinfo -Minfo=accel -cpp -Mlarge_arrays -Mbackslash  -o=small_program test_code.f90
echo -e "${RED}Compilation is finished. Now launching the file.${NC}" #
./small_program #
echo -e "${RED}Done running the file.${NC}"
rm small_program

The result of the execution is below:

Single precision run
user$ sh small_program_run.sh
rm: cannot remove 'small_program': No such file or directory
Start compilation using nvfortran.
allocate_and_run_test_problem:
     69, Generating enter data copyin(test_array(:,:))
     81, Generating present(test_array(:,:))
         Generating NVIDIA GPU code
         82, !$acc loop gang, vector(128) collapse(2) ! blockidx%x threadidx%x
         83,   ! blockidx%x threadidx%x collapsed
     88, Generating present(test_array(:,:))
         Generating NVIDIA GPU code
         89, !$acc loop gang, vector(128) collapse(2) ! blockidx%x threadidx%x
         90,   ! blockidx%x threadidx%x collapsed
     95, Generating present(test_array(:,:))
         Generating NVIDIA GPU code
         96, !$acc loop gang, vector(128) collapse(2) ! blockidx%x threadidx%x
         97,   ! blockidx%x threadidx%x collapsed
    102, Generating present(test_array(:,:))
         Generating NVIDIA GPU code
        103, !$acc loop gang, vector(128) collapse(2) ! blockidx%x threadidx%x
        104,   ! blockidx%x threadidx%x collapsed
    114, Generating exit data copyout(test_array(:,:))
Compilation is finished. Now launching the file.
 Allocated                      8010  Mbytes of GPU memory
 Time of allocation and init:   0.9483220577239990      sec
 -----------------------------------------------------------
 maxval of data before GPU run:    4.000000
 minval of data before GPU run:    4.000000
 -----------------------------------------------------------
 ------ Measuring memory consumption on GPU (copyin) ------
              free (before)   occupied (before)  free (after)    occupied (after)
  - Device 0:  24005.188 Mb       235.563 Mb     15993.188 Mb      8247.563 Mb
 -----------------------------------------------------------
 Time of execution on GPU:   9.7426176071166992E-002 sec
 maxval of data after GPU run:    108.5000
 minval of data after GPU run:    108.5000
 ===========================================================
 Total execution time:    3.551086187362671      sec
Done running the file.

Double precision run
user$ sh small_program_run.sh
rm: cannot remove 'small_program': No such file or directory
Start compilation using nvfortran.
allocate_and_run_test_problem:
     69, Generating enter data copyin(test_array(:,:))
     81, Generating present(test_array(:,:))
         Generating NVIDIA GPU code
         82, !$acc loop gang, vector(128) collapse(2) ! blockidx%x threadidx%x
         83,   ! blockidx%x threadidx%x collapsed
     88, Generating present(test_array(:,:))
         Generating NVIDIA GPU code
         89, !$acc loop gang, vector(128) collapse(2) ! blockidx%x threadidx%x
         90,   ! blockidx%x threadidx%x collapsed
     95, Generating present(test_array(:,:))
         Generating NVIDIA GPU code
         96, !$acc loop gang, vector(128) collapse(2) ! blockidx%x threadidx%x
         97,   ! blockidx%x threadidx%x collapsed
    102, Generating present(test_array(:,:))
         Generating NVIDIA GPU code
        103, !$acc loop gang, vector(128) collapse(2) ! blockidx%x threadidx%x
        104,   ! blockidx%x threadidx%x collapsed
    114, Generating exit data copyout(test_array(:,:))
Compilation is finished. Now launching the file.
 Allocated                     16021  Mbytes of GPU memory
 Time of allocation and init:    1.877160072326660      sec
 -----------------------------------------------------------
 maxval of data before GPU run:    4.000000000000000
 minval of data before GPU run:    4.000000000000000
 -----------------------------------------------------------
 ------ Measuring memory consumption on GPU (copyin) ------
              free (before)   occupied (before)  free (after)    occupied (after)
  - Device 0:  24005.188 Mb       235.563 Mb      7983.188 Mb     16257.563 Mb
 -----------------------------------------------------------
 Time of execution on GPU:   0.1975681781768799      sec
 maxval of data after GPU run:    108.5000000000000
 minval of data after GPU run:    108.5000000000000
 ===========================================================
 Total execution time:    7.411904096603394      sec
Done running the file.

If we divide the “Time of execution on GPU” for both presented cases, we will find that the single precision case is around 2 times faster. We also changed the array size by modifying the M variable, and found consistency at other array sizes as well.

The question is: why does the double precision code works only 2 times slower than single precision, and if our naive guess is correct, how can we invoke the dedicated double precision compute units instead of single precision units?

That is correct.

That is unlikely, as even floating-point intensive code usually contains about 50% integer instructions, suggesting the performance difference will usually be smaller than the ratio of FP64 to FP32 execution unit. Furthermore, the performance of the code may (in part or in whole) not be limited by computational throughput, but my memory throughput.

This would suggest that code performance is entirely limited by memory throughput: Cut the amount of data (measured in bytes) in half by using float, and performance doubles.

Unless you utilize a specialized library specifically for that purpose and adjust the application code accordingly, that will not happen. The CUDA compiler does not automagically swap out IEEE-754 standard compliant double computation for non-standard data representations and operations.

My advice would be to use the CIDA profiler to gain an understanding of the basic performance characteristics of the code in question. It would also be a good idea to perform a roofline analysis.

1 Like

Dear njuffa,

Thank you for providing valuable insights! It indeed looks like a memory-related thing. Could you elaborate more on the memory throughput effect, though? My understanding is that since the data is copied to the device and is processed in a contiguous manner, it should yield the best possible memory performance. So what could be the mechanism of the memory throughput being a bottleneck in this case (ideally, in the code example provided)? If you are able to point to some references with more detail, it would be helpful as well. Thank you!

Your loops appear to do very little computation per data item moved between GPU memory and GPU execution resources, or expressed differently: the compute intensity of this code is very low.

The bandwidth of the RTX A5000 is listed as 768 GB/sec in the TechPowerUp database. If we were to add two double vectors to produce a third (result) vector, and assume perfect efficiency of memory access (realistically: 90%) we would be able to generate (768e9 bytes/sec) / (24 bytes moved / result element) = 32e9 result elements per second. This would require 32e9/sec double additions, while the RTX A5000 is capable of a throughput of 217e9/sec double additions.

If you perform the equivalent computations for your various test loops, none of which appear to comprise more than two FP64 instructions in the loop body, you will find that although the RTX A5000 can process only 217e9 FP64 instructions per second (which in the case of 100% DFMA instructions equates to 434e9 double-precision FLOPS, as FMA = 2 FLOPs), this exceeds the rate at which the required data can be transported to and from the GPU execution units. Meaning the code is completely memory bound.

I re-iterate my previous advice to (1) use the CUDA profiler (2) construct a roofline model of the performance.

Dear njuffa,

Thank you for providing detailed explanations. I hope you don’t mind answering one final question: let’s say, we use mixed precision data on the device for some computation, namely, single precision and double precision floating point. What type of compute unit is invoked in executing such mixed data? Single precision unit or double precision unit?

By the type-conversion rules of each programming language, regular floating-point operations can be identified as either a single-precision operation or a double-precision operation. Single-precision (FP32) operations are handled by FP32 units, double-precision (FP64) operations are handled by FP64 units. On modern GPUs these are physically separate units.

Specialized instructions added in recent years that can accumulate products of lower-precision data into higher-precision sums are handled by specialized units called tensor cores. It is likely that you would be using tensor cores either indirectly through NVIDIA-provided libraries or directly with the wmma class or PTX-level programming. I have no personal interest in and thus no experience with targeting tensor cores and thus cannot provide further advice, but NVIDIA provides ample documentation and examples which should allow you to get you started.

Without Tensor Cores, the single precision data would be converted to double precision first and then double precision operations would be invoked.

1 Like