Conjg a device array

Hello,

I understand that the conjg intrinsic can be used on an array or a matrix, for instance I read in the documentation:

ZU = dot_product(CONJG(ZX),ZY) ! BLAS ZDOTU equivalent

I understand that conjg can also be used on data stored on the device.

However, if I call conjg on a device-side array/matrix/tensor, it segfaults. I am adding a small reproducer below.
Am I calling conjg incorrectly?

Thanks!

program conjtest

  complex(8), device, allocatable, dimension(:,:,:) :: d_A
  complex(8), allocatable, dimension(:,:,:) :: A
  integer(4) :: size, i, j, k
  real(8) :: r1, r2

  size = 16

  allocate( d_A( 1:size, 1:size, 1:size ))
  allocate( A( 1:size, 1:size, 1:size ))

  do i = 1,size
     do j = 1,size
        do k = 1,size
           call random_number( r1 )
           call random_number( r2 )
           d_A( k,i,j ) = cmplx( r1, r2 )
        end do
     end do
  end do

  d_A = A

  ! This works
  !$cuf kernel do(2) <<<*,*>>>
  do i = 1,size
     do j = 1,size
        do k = 1,size
           d_A( k,j,i ) = conjg( d_A( k,j,i ) )
        end do
     end do
  end do

  ! This segfaults
  d_A = conjg( d_A )
  
  A = d_A

  deallocate( A, d_A )
end program conjtest

Look at the Fortran cuda library doc here: NVIDIA Fortran CUDA Library Interfaces Version 24.3 for ARM, OpenPower, x86
We recognize certain patterns and call into the cutensor library, for these types of array operations, we call into the ElementwiseBinary function in cutensor. So, the simplest form you have here doesn’t map to that function (we don’t support this form, currently). You can rewrite it, so we can recognize it, like this:
alpha = cmplx(1.0d0, 0.0d0)
beta = cmplx(0.0d0, 0.0d0)
d_A = alpha * conjg( d_A ) + beta * d_A
You need to add “use cutensorex” at the top of the program. And declare alpha and beta as complex(8). Let me know if that works for you.

Thank you for your reply! I tried the program below, but the line:

  d_A = alpha * conjg( d_A ) + beta * d_A

Gives me the following compilation error:

$ nvfortran -cpp -cuda -O3 -o conjtest conjtest.f90 
/bin/ld: warning: /tmp/pgcudafatl7D_cHSJypf7r.o: missing .note.GNU-stack section implies executable stack
/bin/ld: NOTE: This behaviour is deprecated and will be removed in a future version of the linker
/bin/ld: /tmp/nvfortranSVD_ce_tSoavM.o: in function `MAIN_':
/tmp/conjtest.f90:32: undefined reference to `__nvf_defer_conjg_ndarray_'
/bin/ld: /tmp/conjtest.f90:32: undefined reference to `__nvf_defer_scalec8_unnd_'
/bin/ld: /tmp/conjtest.f90:32: undefined reference to `__nvf_defer_scalec8_rnd_'
/bin/ld: /tmp/conjtest.f90:32: undefined reference to `__nvf_defer_plus_dd_nd_'
/bin/ld: /tmp/conjtest.f90:32: undefined reference to `__nvf_defer_assign_bc8_nd_'
pgacclnk: child process exit status 1: /bin/ld

I cannot find anything about these missing routines, do I need to link against a specific library?

program conjtest

  use cutensorex
  
  complex(8), device, allocatable, dimension(:,:,:) :: d_A
  complex(8), allocatable, dimension(:,:,:) :: A
  integer(4) :: size, i, j, k
  real(8) :: r1, r2
  complex(8):: alpha, beta

  size = 16
 
  allocate( d_A( 1:size, 1:size, 1:size ))
  allocate( A( 1:size, 1:size, 1:size ))

  do i = 1,size
     do j = 1,size
        do k = 1,size
           call random_number( r1 )
           call random_number( r2 )
           d_A( k,i,j ) = cmplx( r1, r2 )
        end do
     end do
  end do

  d_A = A

  alpha = cmplx(1.0d0, 0.0d0)
  beta = cmplx(0.0d0, 0.0d0)
  d_A = alpha * conjg( d_A ) + beta * d_A
  
  A = d_A

  deallocate( A, d_A )
end program conjtest

Add the flag “-cudalib=cutensor” to link against the cuTensor library.

Awesome, thank you! I also fixed my mistake in the initialization (A = ... instead of d_A = ...), and I checked the numerical results: it looks all good :)

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.