Hello,
I am working on developing a large GPU-enabled code using OpenACC in Fortran. Recently, we have tested our code performance using single precision and double precision variable size, and the results of that were surprising to us. We are using several NVIDIA RTX A5000 24Gb cards, and according to this post’s answer that links to the following GPU spec sheet, the performance of single precision to double precision when using CUDA cores has 64:1 ratio. We assumed that it means that there are 64 times more single precision compute units compared to double precision units, and we expected a comparable time difference when running a single precision problem vs a double precision problem. However, what we found in practice is that the single precision version of the code is almost exactly 2 times faster than the double precision version. Our naive interpretation was that maybe the GPU is performing some sort of “double-single” arithmetic as discussed in the last message of this post and combining 2 single precision computing units into one double precision unit.
Below we provide a sample Fortran code that illustrates our observation:
module declare_arrays
use openacc
use cudafor
implicit none
integer(8),parameter :: prec = 8
real(prec),allocatable :: test_array(:,:)
contains
end module declare_arrays
module run_code
implicit none
contains
subroutine allocate_and_run_test_problem
use declare_arrays
use openacc
use cudafor
implicit none
integer(8) :: i, j
integer(8) :: N, M
integer :: istat, devicenum
integer(kind=cuda_count_kind) :: free_mem_gpu, total_mem_gpu
real(8) :: t1, t2, t3, t4
! these numbers should produce around 8 Gb problem with single precision
! and around 16 Gb problem for double precision
N = 10000000
M = 210
! using GPU#0, which is NVIDIA RTX A5000 (24Gb)
devicenum = 0
call acc_set_device_num(devicenum,acc_device_nvidia)
call cpu_time(t1)
! allocate the array
allocate(test_array(N,M))
! initial value for all array elements
test_array = 4_prec
call cpu_time(t2)
write(*,*) 'Allocated ',N*M*prec/(1024_prec*1024_prec), ' Mbytes of GPU memory'
write(*,*) 'Time of allocation and init:', (t2 - t1), 'sec'
write(*,*) '-----------------------------------------------------------'
write(*,*) 'maxval of data before GPU run:',maxval(test_array)
write(*,*) 'minval of data before GPU run:',minval(test_array)
write(*,*) '-----------------------------------------------------------'
! monitoring occupied memory on GPU (for current device)
write(*,*) '------ Measuring memory consumption on GPU (copyin) ------'
write(*,'(A14,A14,A20,A16,A17)') ' ','free (before) ',' occupied (before) ',&
' free (after) ','occupied (after)'
! this piece of code goes before the memory copy to GPU
istat = cudaMemGetInfo(free_mem_gpu, total_mem_gpu)
write(*,'(A11,I1,A2,F10.3,A7,F10.3,A7)',advance='no') ' - Device ',devicenum,': ', &
real(free_mem_gpu,prec)/1024_prec/1024_prec, ' Mb ',&
real(total_mem_gpu-free_mem_gpu,prec)/1024_prec/1024_prec,' Mb '
!$acc enter data copyin(test_array)
! this piece goes after the memory copy to show how much memory is used on GPU
istat = cudaMemGetInfo(free_mem_gpu, total_mem_gpu)
write(*,'(F10.3,A7,F10.3,A3)',advance='yes') real(free_mem_gpu,prec)/1024_prec/1024_prec, ' Mb ',&
real(total_mem_gpu-free_mem_gpu,prec)/1024_prec/1024_prec,' Mb'
call cpu_time(t3)
! several GPU loops are given here to emulate the structure of the code
! in our large code
!$acc parallel loop collapse(2) present(test_array)
do i = 1, M
do j = 1, N
test_array(j,i) = (test_array(j,i) * 2_prec - 1_prec)**2_prec
enddo
enddo
!$acc parallel loop collapse(2) present(test_array)
do i = 1, M
do j = 1, N
test_array(j,i) = test_array(j,i) - 4_prec
enddo
enddo
!$acc parallel loop collapse(2) present(test_array)
do i = 1, M
do j = 1, N
test_array(j,i) = (test_array(j,i) * 3_prec) - (test_array(j,i)/2_prec)
enddo
enddo
!$acc parallel loop collapse(2) present(test_array)
do i = 1, M
do j = 1, N
test_array(j,i) = test_array(j,i) - 4_prec
enddo
enddo
call cpu_time(t4)
write(*,*) '-----------------------------------------------------------'
write(*,*) 'Time of execution on GPU:', (t4 - t3), 'sec'
!$acc exit data copyout(test_array)
write(*,*) 'maxval of data after GPU run:',maxval(test_array)
write(*,*) 'minval of data after GPU run:',minval(test_array)
if(allocated(test_array)) deallocate(test_array)
call cpu_time(t4)
write(*,*) '==========================================================='
write(*,*) 'Total execution time:', (t4 - t1), 'sec'
end subroutine allocate_and_run_test_problem
end module run_code
program test_pinned_mem
use run_code
implicit none
call allocate_and_run_test_problem
end program
In the code, we change the prec variable from 8 to 4 to change from double precision to single precision. Our compiler is HPC SDK 24.1, and our OS is Rocky Linux 8.7. To compile and run, we use the following SH script:
RED='\033[0;34m' #
NC='\033[0m' #
rm small_program
echo -e "${RED}Start compilation using nvfortran.${NC}" #
nvfortran -fast -O3 -mp -cuda -acc -gpu=cc86,deepcopy,cuda12.3,lineinfo -Minfo=accel -cpp -Mlarge_arrays -Mbackslash -o=small_program test_code.f90
echo -e "${RED}Compilation is finished. Now launching the file.${NC}" #
./small_program #
echo -e "${RED}Done running the file.${NC}"
rm small_program
The result of the execution is below:
Single precision run
user$ sh small_program_run.sh
rm: cannot remove 'small_program': No such file or directory
Start compilation using nvfortran.
allocate_and_run_test_problem:
69, Generating enter data copyin(test_array(:,:))
81, Generating present(test_array(:,:))
Generating NVIDIA GPU code
82, !$acc loop gang, vector(128) collapse(2) ! blockidx%x threadidx%x
83, ! blockidx%x threadidx%x collapsed
88, Generating present(test_array(:,:))
Generating NVIDIA GPU code
89, !$acc loop gang, vector(128) collapse(2) ! blockidx%x threadidx%x
90, ! blockidx%x threadidx%x collapsed
95, Generating present(test_array(:,:))
Generating NVIDIA GPU code
96, !$acc loop gang, vector(128) collapse(2) ! blockidx%x threadidx%x
97, ! blockidx%x threadidx%x collapsed
102, Generating present(test_array(:,:))
Generating NVIDIA GPU code
103, !$acc loop gang, vector(128) collapse(2) ! blockidx%x threadidx%x
104, ! blockidx%x threadidx%x collapsed
114, Generating exit data copyout(test_array(:,:))
Compilation is finished. Now launching the file.
Allocated 8010 Mbytes of GPU memory
Time of allocation and init: 0.9483220577239990 sec
-----------------------------------------------------------
maxval of data before GPU run: 4.000000
minval of data before GPU run: 4.000000
-----------------------------------------------------------
------ Measuring memory consumption on GPU (copyin) ------
free (before) occupied (before) free (after) occupied (after)
- Device 0: 24005.188 Mb 235.563 Mb 15993.188 Mb 8247.563 Mb
-----------------------------------------------------------
Time of execution on GPU: 9.7426176071166992E-002 sec
maxval of data after GPU run: 108.5000
minval of data after GPU run: 108.5000
===========================================================
Total execution time: 3.551086187362671 sec
Done running the file.
Double precision run
user$ sh small_program_run.sh
rm: cannot remove 'small_program': No such file or directory
Start compilation using nvfortran.
allocate_and_run_test_problem:
69, Generating enter data copyin(test_array(:,:))
81, Generating present(test_array(:,:))
Generating NVIDIA GPU code
82, !$acc loop gang, vector(128) collapse(2) ! blockidx%x threadidx%x
83, ! blockidx%x threadidx%x collapsed
88, Generating present(test_array(:,:))
Generating NVIDIA GPU code
89, !$acc loop gang, vector(128) collapse(2) ! blockidx%x threadidx%x
90, ! blockidx%x threadidx%x collapsed
95, Generating present(test_array(:,:))
Generating NVIDIA GPU code
96, !$acc loop gang, vector(128) collapse(2) ! blockidx%x threadidx%x
97, ! blockidx%x threadidx%x collapsed
102, Generating present(test_array(:,:))
Generating NVIDIA GPU code
103, !$acc loop gang, vector(128) collapse(2) ! blockidx%x threadidx%x
104, ! blockidx%x threadidx%x collapsed
114, Generating exit data copyout(test_array(:,:))
Compilation is finished. Now launching the file.
Allocated 16021 Mbytes of GPU memory
Time of allocation and init: 1.877160072326660 sec
-----------------------------------------------------------
maxval of data before GPU run: 4.000000000000000
minval of data before GPU run: 4.000000000000000
-----------------------------------------------------------
------ Measuring memory consumption on GPU (copyin) ------
free (before) occupied (before) free (after) occupied (after)
- Device 0: 24005.188 Mb 235.563 Mb 7983.188 Mb 16257.563 Mb
-----------------------------------------------------------
Time of execution on GPU: 0.1975681781768799 sec
maxval of data after GPU run: 108.5000000000000
minval of data after GPU run: 108.5000000000000
===========================================================
Total execution time: 7.411904096603394 sec
Done running the file.
If we divide the “Time of execution on GPU” for both presented cases, we will find that the single precision case is around 2 times faster. We also changed the array size by modifying the M variable, and found consistency at other array sizes as well.
The question is: why does the double precision code works only 2 times slower than single precision, and if our naive guess is correct, how can we invoke the dedicated double precision compute units instead of single precision units?