If I have a fortran code of the form
do t=1,tn ! main loop over time
Many results and arrays which are passed into one routine are also used in the following routine (for example, “b” might use the results of “a” and/or some of the same input parameters which did not change in “a”).
Now, I have a CUDA kernel written for each routine. Currently there is lots of unnecessary data transfer: the results of “a” are passed back to the host, then immediately passed back to the gpu for “b”. How do I just keep the needed data from “a” on the GPU for direct use by “b”?
One way to do it would be to combine all the subroutines into one massive routine, and then have one call to one massive kernel. This seems the most straightforward way to do it, but I’m wondering what the more elegant/clean solution is. I know this must be like basic cuda 101, but I’m having trouble finding the exact documentation that covers this.