I am currently trying to move a Fortran code to run on the GPU. I am using cudaFortran and PGI compiler version 18.10-1
The program I am trying to execute on the GPU looks like this
program xyz do i=1,N ! desired loop to push to GPU call subroutine1(r(i),rad(i) end do ... end program xyz module test contains subroutine1(x,y) call subroutine2(x) ! additional operations call subroutine3(y) end subroutine1 subroutine2 .. end subroutine2
Now I want to deploy this loop on the GPU, reading the manual and guidelines, I was able to come up with something like this
program call subroutine1_parallel<<<1,N>>>(r,Rad) istat=cudaDevicesynchronize() ... .. end program module contains attributes(global) subroutine1_parallel i=threadIdx%x if (i<n) then call subroutine1(r(i),Rad(i)) end if end subroutine attributes(device) subroutine1(x,y) call subroutine2(x) call subroutine3(y) .. .. end subroutine end module
I have two questions regarding this
Is there a more efficient way to call a subroutine in a loop which takes input arguments as a function of the loop index?
The cudadevice synchronize() statement doesn’t seem to wait for subroutine 2 and subroutine 3 to finish execution, it just seems to wait to finish the execution of subroutine 1, am I doing something wrong, because the program seems to terminate for no reason.
I am new to parallel programming and using cudaFrortran, any guidance is much appreciated.