Example about using texture

Hello.

I’m reading CUDA Fortran Reference Guide 2016 about texture use.

And I’m confronted with a tricky example code.

module memtests
 real(8), texture, pointer :: t(:) ! declare the texture
 contains
 attributes(device) integer function bitrev8(i)
 integer ix1, ix2, ix
 ix = i
 ix1 = ishft(iand(ix,z'0aa'),-1)
 ix2 = ishft(iand(ix,z'055'), 1)
 ix = ior(ix1,ix2)
 ix1 = ishft(iand(ix,z'0cc'),-2)
 ix2 = ishft(iand(ix,z'033'), 2)
 ix = ior(ix1,ix2)
 ix1 = ishft(ix,-4)
 ix2 = ishft(ix, 4)
 bitrev8 = iand(ior(ix1,ix2),z'0ff')
 end function bitrev8
 
 attributes(global) subroutine without( a, b )
 real(8), device :: a(*), b(*)
 i = blockDim%x*(blockIdx%x-1) + threadIdx%x
 j = bitrev8(threadIdx%x-1) + 1
 b(i) = a(j)
 return
 end subroutine
attributes(global) subroutine withtex( a, b )
 real(8), device :: a(*), b(*)
 i = blockDim%x*(blockIdx%x-1) + threadIdx%x
 j = bitrev8(threadIdx%x-1) + 1
 b(i) = t(j) ! This subroutine accesses a through the texture
 return
 end subroutine
end module memtests

program t
use cudafor
use memtests
real(8), device, target, allocatable :: da(:)
real(8), device, allocatable :: db(:)
integer, parameter :: n = 1024*1024
integer, parameter :: nthreads = 256
integer, parameter :: ntimes = 1000
type(cudaEvent) :: start, stop
real(8) b(n)
allocate(da(nthreads))
allocate(db(n))
istat = cudaEventCreate(start)
istat = cudaEventCreate(stop)
db = 100.0d0
da = (/ (dble(i),i=1,nthreads) /)
call without<<<n/nthreads, nthreads>>> (da, db)
istat = cudaEventRecord(start,0)
do j = 1, ntimes
 call without<<<n/nthreads, nthreads>>> (da, db)
end do
istat = cudaEventRecord(stop,0)
istat = cudaDeviceSynchronize()
istat = cudaEventElapsedTime(time1, start, stop)
time1 = time1 / (ntimes*1.0e3)
b = db
print *,sum(b)==(n*(nthreads+1)/2)
db = 100.0d0
t => da ! assign the texture to da using f90 pointer assignment
call withtex<<<n/nthreads, nthreads>>> (da, db)
istat = cudaEventRecord(start,0)
do j = 1, ntimes
 call withtex<<<n/nthreads, nthreads>>> (da, db)
end do
istat = cudaEventRecord(stop,0)
istat = cudaDeviceSynchronize()
istat = cudaEventElapsedTime(time2, start, stop)
time2 = time2 / (ntimes*1.0e3)
b = db
print *,sum(b)==(n*(nthreads+1)/2)
print *,"Time with textures",time2
print *,"Time without textures",time1
print *,"Speedup with textures",time1 / time2
deallocate(da)
deallocate(db)
end

This is an example at page 17-18.

What is bitrev8 doing?

Do I have to do that kind of mapping when I use textures?

Hi,

It’s just doing bit-reversed addressing which is common to FFTs, as an example. Here’s a simpler example. The point is that reads through the texture cache are implemented in CUDA Fortran as reads via an F90 pointer, which should declared at the module level.

module tex
real(8), texture, pointer :: t_d_a(:)
contains
attributes(global) subroutine tex_kernel(b)
real(8), device :: b()
i = blockDim%x
(blockIdx%x-1) + threadIdx%x
b(i) = t_d_a(i) + 1.0d0
return
end subroutine
end module tex

program main
use cudafor
use tex
real(8) h_b(10)
real(8), target, device :: d_a(10), d_b(10)
d_a = 1.0d0
t_d_a => d_a
call tex_kernel<<<1,10>>>(d_b)
h_b = d_b
print *,all(h_b.eq.2.0d0)
end program main

Oh, thank you very much for clear answer.

Then, you are saying that texture variables should be declared in modules, and the kernel subroutines should be contained in it as well, right?

What if I want to use multiple GPUs?

If I USE the module in host, and launch kernels on multiple GPUs using OpenMP, does it still work?

I want to learn about proper way of using variable-containing modules in multi-GPU implementations.

Generally, module data is global, and there’s just one copy. Same applies for module data that resides on the GPU. For performance, you’ll likely want a separate copy of the data for each GPU. So, you’ll have to name them differently or somehow access the data indirectly. I’ll try to put together an example.

Here’s an example that I’ve run on up to 8 threads and 8 GPUs, written by a colleague of mine.

% pgf90 -mp textureMGPU.cuf
% ./a.out
Device 0: Tesla K80
Device 1: Tesla K80
Device 2: Tesla K80
Device 3: Tesla K80
Device 4: Tesla K80
Device 5: Tesla K80
Device 6: Tesla K80
Device 7: Tesla K80

OMP threads: 8

Device 0 max error: 0.000000
First 3: 2.000000 2.000000 2.000000
Device 1 max error: 0.000000
First 3: 4.000000 4.000000 4.000000
Device 2 max error: 0.000000
First 3: 6.000000 6.000000 6.000000
Device 3 max error: 0.000000
First 3: 8.000000 8.000000 8.000000
Device 4 max error: 0.000000
First 3: 10.00000 10.00000 10.00000
Device 5 max error: 0.000000
First 3: 12.00000 12.00000 12.00000
Device 6 max error: 0.000000
First 3: 14.00000 14.00000 14.00000
Device 7 max error: 0.000000
First 3: 16.00000 16.00000 16.00000


module kernels_m
real, texture, pointer :: aTex(:)
contains
attributes(global) subroutine k(b, s)
real :: b()
integer, value :: s
integer :: i
i = blockDim%x
(blockIdx%x-1)+threadIdx%x
b(i) = aTex(i)+s
end subroutine k
end module kernels_m

program main
use cudafor
use omp_lib
use kernels_m

implicit none

integer, parameter :: n = 256*1024
type distributedArray
real, device, pointer :: v(:)
end type distributedArray

type(distributedArray) :: da(0:7), db(0:7)
real :: res(n)

type(cudaDeviceProp) :: prop
integer :: i, istat, ip1, numdevices

istat = cudaGetDeviceCount(numdevices)

do i = 0, numdevices-1
istat = cudaGetDeviceProperties(prop, 0)
write(*,’("Device “,i4,”: ",a)’) i, trim(prop%name)
end do

call omp_set_num_threads(numdevices)

!$OMP PARALLEL PRIVATE(i, ip1, istat)
i = omp_get_thread_num()
if (i==0) write(,) ‘# OMP threads:’, omp_get_num_threads()
istat = cudaSetDevice(i)
allocate(da(i)%v(n), db(i)%v(n))
ip1 = i+1
da(i)%v = ip1
aTex => da(i)%v
call k<<<1024,256>>>(db(i)%v, ip1)
!$OMP END PARALLEL

do i = 0, numdevices-1
istat = cudaSetDevice(i)
res = db(i)%v
write(,) 'Device ‘,i,’ max error: ', maxval(abs(res-2.i-2.))
write(
,*) ‘First 3:’, res(1:3)
end do
end program main

Also, I should mention that dummy arguments marked as intent(in) will be read through the texture cache in CUDA Fortran. The texture mechanism you are using here is more cumbersome, but was put in place before a method to read through the texture cache without using textures was available. Now, reads of intent(in) dummy arguments are read using the CUDA ldg() functionality.

Great thanks to you and your colleague.

Does using INTENT(IN) attribute work in device subprograms as well?

What I mean by device subprogram is the subroutine with ATTRIBUTES(DEVICE) which is called by kernel.