Hello,
I am currently trying to move a Fortran code to run on the GPU. I am using cudaFortran and PGI compiler version 18.10-1
The program I am trying to execute on the GPU looks like this
F95 code:
program xyz
do i=1,N ! desired loop to push to GPU
call subroutine1(r(i),rad(i)
end do
...
end program xyz
module test
contains
subroutine1(x,y)
call subroutine2(x) ! additional operations
call subroutine3(y)
end subroutine1
subroutine2
..
end subroutine2
Now I want to deploy this loop on the GPU, reading the manual and guidelines, I was able to come up with something like this
program
call subroutine1_parallel<<<1,N>>>(r,Rad)
istat=cudaDevicesynchronize()
...
..
end program
module
contains
attributes(global) subroutine1_parallel
i=threadIdx%x
if (i<n) then
call subroutine1(r(i),Rad(i))
end if
end subroutine
attributes(device) subroutine1(x,y)
call subroutine2(x)
call subroutine3(y)
..
..
end subroutine
end module
I have two questions regarding this
-
Is there a more efficient way to call a subroutine in a loop which takes input arguments as a function of the loop index?
-
The cudadevice synchronize() statement doesn’t seem to wait for subroutine 2 and subroutine 3 to finish execution, it just seems to wait to finish the execution of subroutine 1, am I doing something wrong, because the program seems to terminate for no reason.
I am new to parallel programming and using cudaFrortran, any guidance is much appreciated.
Thank you!
Hi Teja Konduri,
- Is there a more efficient way to call a subroutine in a loop which takes input arguments as a function of the loop index?
This should be fine. The only thing you want avoid is to pass in non-contiguous sub-arrays are arguments since the compile will need to create a temp array so the sub-arrays is passed as a contiguous array.
- The cudadevice synchronize() statement doesn’t seem to wait for subroutine 2 and subroutine 3 to finish execution, it just seems to wait to finish the execution of subroutine
cudaDeviceSynchronize can only be called from the host. Within the kernel, you’ll want to call “__syncthreads”.
1, am I doing something wrong, because the program seems to terminate for no reason.
There’s not sufficient information to help determine why the code if failing. Can you post a reproducing example?
-Mat
Hi Matt,
Thank you for your reply.
The only thing you want avoid is to pass in non-contiguous sub-arrays
I have learned this the hard way, I created temp arrays to pass into the subroutine so that reshaping and memory allocation is not a problem.
cudaDeviceSynchronize can only be called from the host. Within the kernel, you’ll want to call “__syncthreads”.
I am calling devicesynchronize() from the host (program) . My subroutines1,2 and 3 are executing on the device. Do you mean that I need to add syncthreads statements in the subroutine 1 ,2 ,3?
On a separate note, in order to be able to debug the problem (right now the code just exits without any error) is there a way I can program to see in the output stream print* statements which are executed on the GPU ? I mean a print statement encountered in a subroutine which is running on the device.
Thank you so much for your help
Teja
On a separate note, in order to be able to debug the problem (right now the code just exits without any error) is there a way I can program to see in the output stream print* statements which are executed on the GPU ? I mean a print statement encountered in a subroutine which is running on the device.
Turns out I was not using devicesynchronize() properly, print statements from device are directed to the output stream.