Question on executing subroutines in a loop on the GPU

Hello,

I am currently trying to move a Fortran code to run on the GPU. I am using cudaFortran and PGI compiler version 18.10-1

The program I am trying to execute on the GPU looks like this

F95 code:

program xyz
do i=1,N                 ! desired loop to push to GPU
call subroutine1(r(i),rad(i)
end do
...
end program xyz

module test
contains
subroutine1(x,y)
call subroutine2(x)  ! additional operations
call subroutine3(y)
end subroutine1
subroutine2
..
end subroutine2

Now I want to deploy this loop on the GPU, reading the manual and guidelines, I was able to come up with something like this

program
call subroutine1_parallel<<<1,N>>>(r,Rad)
istat=cudaDevicesynchronize()
...
..
end program

module 
contains
attributes(global) subroutine1_parallel
i=threadIdx%x
if (i<n) then
call subroutine1(r(i),Rad(i))
end if
end subroutine

attributes(device) subroutine1(x,y)
call subroutine2(x)
call subroutine3(y)
..
..
end subroutine
end module

I have two questions regarding this

  1. Is there a more efficient way to call a subroutine in a loop which takes input arguments as a function of the loop index?

  2. The cudadevice synchronize() statement doesn’t seem to wait for subroutine 2 and subroutine 3 to finish execution, it just seems to wait to finish the execution of subroutine 1, am I doing something wrong, because the program seems to terminate for no reason.

I am new to parallel programming and using cudaFrortran, any guidance is much appreciated.

Thank you!

Hi Teja Konduri,

  1. Is there a more efficient way to call a subroutine in a loop which takes input arguments as a function of the loop index?

This should be fine. The only thing you want avoid is to pass in non-contiguous sub-arrays are arguments since the compile will need to create a temp array so the sub-arrays is passed as a contiguous array.

  1. The cudadevice synchronize() statement doesn’t seem to wait for subroutine 2 and subroutine 3 to finish execution, it just seems to wait to finish the execution of subroutine

cudaDeviceSynchronize can only be called from the host. Within the kernel, you’ll want to call “__syncthreads”.

1, am I doing something wrong, because the program seems to terminate for no reason.

There’s not sufficient information to help determine why the code if failing. Can you post a reproducing example?

-Mat

Hi Matt,

Thank you for your reply.

The only thing you want avoid is to pass in non-contiguous sub-arrays

I have learned this the hard way, I created temp arrays to pass into the subroutine so that reshaping and memory allocation is not a problem.

cudaDeviceSynchronize can only be called from the host. Within the kernel, you’ll want to call “__syncthreads”.

I am calling devicesynchronize() from the host (program) . My subroutines1,2 and 3 are executing on the device. Do you mean that I need to add syncthreads statements in the subroutine 1 ,2 ,3?

On a separate note, in order to be able to debug the problem (right now the code just exits without any error) is there a way I can program to see in the output stream print* statements which are executed on the GPU ? I mean a print statement encountered in a subroutine which is running on the device.

Thank you so much for your help

Teja

On a separate note, in order to be able to debug the problem (right now the code just exits without any error) is there a way I can program to see in the output stream print* statements which are executed on the GPU ? I mean a print statement encountered in a subroutine which is running on the device.

Turns out I was not using devicesynchronize() properly, print statements from device are directed to the output stream.