I’m aware that when working with any of the CUDA compilers (I’m working with PGI fortran specifically), that if no host/device/shared attribute is explicitly declared that the default is host. However, the manual also makes note that a function/subroutine can be declared using both the host and device attributes. The only way I can see this working is if, within a host subroutine, the default is host, and within a device subroutine, the default is device. I’m not sure how else you could benefit from that dual-declaration.
Does anyone know for sure if this works the way it is implied?
In CUDA C, if you mark a function as both host and device, it basically gets compiled twice, once for use in host code and once for use in device code. It’s a nice convenience to avoid having to repeat yourself for utility functions that need to work both places.
But when you declare a function as such, and it contains a variable declaration with no explicitly stated location definition (host, device), does it compiled for main memory in the host version and global device memory in the device version?
It may help to explain the scenario. I have a workhorse of a loop in a subroutine/method high up the hierarchy. Within that loop, 11 other subroutines/methods are called. Assuming (and this is a big assumption, but for arguments sake) that each of these contains no data coupling with other modules/classes, can I simply add a host,device tag to each and call them from my new kernel subroutine/method as normal?
It doesn’t compile to anything in the device version. In CUDA C, all device functions are inline expanded in kernel code during compilation. So any local scope variables in the device function will resolve into a register or a local memory location inside any kernels that “call” the device function. All of the mechanics of this are completely handled by the compiler and transparent to the programmer. You don’t need to worry about how it works.
If you are talking about local variables in the function, they have to go to registers or local memory in the device case and registers or system memory in the host case. The scope of these variables is confined to the function, so it doesn’t really matter exactly where the compiler chooses to store the values.
In general, yes. I have no experience with the PGI Fortran compiler, but in the CUDA C case, this only doesn’t work if the functions access global (in the scoping sense) variables, or make use of C++ templates that nvcc cannot compile.
Again, in the CUDA C case, the main requirement is that the source for all of the device functions you use have to be visible to the CUDA compiler at the same time because they will all be inlined. (There is no device linker that could be used to combine object files, like is commonly done with normal CPU code.) This often results in having to #include other source files. I have no idea how CUDA Fortran works in this respect.
Both of your responses are extremely detailed and helpful. Unfortunately this program was written very much with serial processing in mind (its been a live software since the 90s) and has the structure of storing many globally necessary variables in separate modules and just importing them as needed. Since Fortran is pass by reference (no need for return statements), it can also become very confusing which variables are being altered and which are merely being referenced.
However, it does sound like I will be able then to adapt the current subroutines to run on the device with the technique mentioned above and possibly some quick edits to how information is passed to those subroutines.