I have an old Fortran 77 code that I’m trying to migrate to a GPU. I basically want to execute this old code N times, hence the GPU is a perfect fit. I wrote what is essentially a driver program to accept arrays of variables (1xN) and send them to the GPU subroutine to execute my old Fortran 77 code with one set of values at a time. In a nutshell, the GPU subroutine is going to have maybe a few dozen lines and then the rest I’d like to paste in the Fortran 77 code. Is there any way I can compile my GPU subroutine using pgfortran and include compatibility for the legacy versions of Fortran?
Thanks in advance!
Hi jeremyw,
CUDA Fortran does require a few F90 features, notably interfaces when calling global device routines. F90 Modules and Allocatable arrays make things easier, but are not required. So you should be fine provided you can partition your program accordingly.
One thing you may consider is to use the PGI Accelerator/OpenACC directive based approach. The only time F90 is required is when passing device data between routines. If your GPU code is contained within a single subroutine, directives will make it much easier to port as well as be more portable.
Hope this helps,
Mat
Hi Mat,
Thanks for your prompt reply. Unfortunately the F77 code I have is one primary subroutine that does the bulk of the work but also has maybe 15-20 small subroutines also in the code that it calls.
I’ve attempted to compile the F77 code using PGFORTRAN and it spits out all kinds of errors. It seems like there are major changes in the structure of the fortran coding between versions that it would require a major rewrite to get it to work. For example, here is a logical variable definition from the F77 file…
LOGICAL PLAST,DEBUG,ROUGH,UNFLAG,TOTAL,ITS,ELASTIC,
$ DEATH,FLAG1D,FLAG2D,RELOAD
To get this to work with PGFORTRAN I’d have to do something like,
LOGICAL PLAST,DEBUG,ROUGH,UNFLAG,TOTAL,ITS,ELASTIC, &
DEATH,FLAG1D,FLAG2D,RELOAD
I have tried compiling the old code with pf77 and it works fine. It seems like there’s got to be an easier way of doing this that doesn’t involve rewriting the old F77 code… Is there anyway to use a compiled object file and call it from the GPU subroutine?
Thanks again for the help.
Hi jeremyw,
What are the errors? pgfortran should be able to compile F77 code with problems. The problem you note with the continuation line looks more like a Fixed versus Free format issue rather then pgf77 vs pgfortran.
What flags do you use to compile? What is the extension of the file?
By default, files with “.f” or “.F” suffix are compiled using Free form while “.f90” or “.F90” are Fixed form. So if you had renamed your Free form files to use a f90 extension, then this would cause errors. Errors would also occur if you used the “-Mfixed” flag to force Fixed format.
Is there anyway to use a compiled object file and call it from the GPU subroutine?
If you are writing CUDA Fortran code, then you can call routines with the device attribute. However, the global kernel and the device routines need to be contained within the same module.
For the directives, these routines would need to be inlined, either manually, or automatically by the compiler via the “-Minline” or “-Mipa=inline” flags.
Hi Mat,
Thanks for all of the suggestions. I’ve come a long way since my last post. My code is up and running successfully on the GPU but now I’m tasked with the much more difficult job of optimization. Turns out the translation from F77 to F90 wasn’t as bad as I initially thought, it primarily just involved changing C’s and *'s to !'s, changing $'s to &'s, and adding declarations to subroutines. All the errors I was seeing were indirectly related to one of these three things so they all started to go away very quickly once I made the translation.
Anyways, now that I’m up and running I’m a little bit surprised how poorly the bandwidth is between device and host memory in my application.
Running bandwidthTest on my Tesla M2090, I’m seeing speeds of right around 4GB/sec for pageable memory which seems reasonable. In my application, I’m basically just sending some really large arrays over to the device memory. One of these arrays in particular is by far the largest and dominates the transfer time. It’s a 3D array of size 6x8xM of double precision variables, where M is some arbitrary length. For the sake of testing, I defined M equal to 100,000. Therefore, the size of the 6x8xM is 38400000 bytes, or 0.0384 gigabytes. At a speed of 4GB/sec this array should take somewhere in the ballpark 10ms to send to the device. When I run the application I’m seeing a total time of about 40ms, therefore the transfer is happening at about 1GB/sec. Do you have any ideas why I’m seeing such a poor performance? This transfer time is absolutely killing my runtime. Right now it takes 40ms to transfer to the device while only 2.5ms to run the kernel M times.
For reference, I’m allocating this array by:
allocate(arr_dp_dev(6,8,M))
and transferring to device by:
arr_dp_dev(1:6,1:8,1:M) = arr_dp_host(1:6,1:8,1:M)
On a side note, I’ve been compiling with the flag -Mcuda=4.1 for the toolkit I have. I just noticed this morning I tried -Mcuda, -Mcuda=cc2.x and -Mcuda=keepbin and they all three hang the compiler. For some reason the 4.1 flag is the only way it will compile and run. Any thoughts as to why this is?
Thanks again for the assistance!
Hi jeremyw,
One thing that would help is to use the pinned attribute on “arr_dp_host”. Assuming the OS honours the request (it may ignore it), the array’s memory will be pinned in physical memory thus removing the extra host to host memory copy (virtual to physical) that’s required for DMA transfers.
Secondly, try changing “arr_dp_dev(1:6,1:8,1:M) = arr_dp_host(1:6,1:8,1:M)” to “arr_dp_dev = arr_dp_host”. This makes it easier for the compiler to optimise data movement. Though, since you’re using a 12.x this many not be necessary given improvements in our latest compiler versions.
How are you profiling your code to determine your transfer speed?
On a side note, I’ve been compiling with the flag -Mcuda=4.1 for the toolkit I have. I just noticed this morning I tried -Mcuda, -Mcuda=cc2.x and -Mcuda=keepbin and they all three hang the compiler. For some reason the 4.1 flag is the only way it will compile and run. Any thoughts as to why this is?
Are your kernels large? CUDA 4.0, which is the default, can have issues with large kernels and gives the appearance that the compilation has hung. To see if it’s hanging in the CUDA back-end tools, add the “-v” flag. If the hang occurs during the “pgnvd” step, then it’s the NVIDIA tools.
Now we’re getting somewhere!
try changing “arr_dp_dev(1:6,1:8,1:M) = arr_dp_host(1:6,1:8,1:M)” to “arr_dp_dev = arr_dp_host”. This makes it easier for the compiler to optimise data movement. Though, since you’re using a 12.x this many not be necessary given improvements in our latest compiler versions.
I tried this first, amazingly it took the transfer rate from 1GB/sec to 4GB/sec. This may be something you would like to pass along to development.
One thing that would help is to use the pinned attribute on “arr_dp_host”. Assuming the OS honours the request (it may ignore it), the array’s memory will be pinned in physical memory thus removing the extra host to host memory copy (virtual to physical) that’s required for DMA transfers.
I tried this next and it took me from 4GB/sec to 6GB/sec! Nice!
How are you profiling your code to determine your transfer speed?
For the most part I am using calls to CPU_TIME except for timing the kernel. I know I should be using the cuda event records but I haven’t taken the time to set it up. After the kernel is done running there’s nothing left in the host code to execute. So to get the kernel time I take the full time to run the host code and subtract the time it takes to go from the beginning up to kernel call. I figured this should be a brute force way of doing it.
Are your kernels large? CUDA 4.0, which is the default, can have issues with large kernels and gives the appearance that the compilation has hung. To see if it’s hanging in the CUDA back-end tools, add the “-v” flag. If the hang occurs during the “pgnvd” step, then it’s the NVIDIA tools.
You were spot on with this. Yes, the kernel is quite large with many subroutines. Adding the -v flag sure enough indicated the hang you mentioned. I tried -Mcuda=4.1 -Mcuda=keepbin and it worked this time. Guess the older toolkits are no good for my code but I can live with that.
Now, I thought everything was splendid until I started looking closer at the difference between pinned and paged memory. Check out these run-time comparisons:
Paged memory for my host arrays:
Total GPU run-time: 220ms
Time to transfer arr_dp_host (192mb): 38.5ms or close to 5GB/sec
Kernel run-time: 11ms
Pinned memory for my host arrays:
Total GPU run-time: 314ms
Time to transfer arr_dp_host (192mb): 31ms or about 6.2GB/sec
Kernel run-time: 112ms
Any idea what happened with the pinned memory that caused my kernel to take 10x longer?
So to get the kernel time I take the full time to run the host code and subtract the time it takes to go from the beginning up to kernel call.
Be careful here since kernels are launched asynchronously to the host code. So unless you have some synchronisation point, such as a device to host copy or call to cudaThreadSynchronize, before calling cputime, you’re not timing the kernel call.
Any idea what happened with the pinned memory that caused my kernel to take 10x longer?
It doesn’t make much sense that it would, so my best guess is that it’s your timing method that’s to blame for the discrepancy.
Try setting the environment variable “CUDA_PROFILE=1” and rerun. After your run, the CUDA driver will create a “cuda_profile_0.log” file with profiling information about your kernel. You can also use pgcollect/PGPROF if you want a mixed host and device profile.
Thanks again Mat.
Everything appears to be working well now except the return of data from the device memory to host memory. For some reason when running in emulation mode this command actually performs the transfer but when fully compiled and executed the return transfer doesn’t seem to occur correctly. Any thoughts?
call subroutine<<< >>> ()
i = cudaThreadSynchronize
PSTRS_host = PSTRS_dev
PRINT*, PSTRS_host(1:6,M)
The print command returns all zeros and not the correct values when executed, but if I run in emulation mode it returns the values correctly.
Check the return status from your kernel call. It’s probably getting some resource limit error.
ierr=cudaGetLastError()
if (ierr .ne. 0) then
print *, 'Kernel error: ', cudaGetErrorString(ierr)
endif
Wow you’re good!
Kernel error:
too many resources requested for launch
When I set M to 10,000 it worked. When set to 50,000 it gave the error. How do I get around this?
What’s your formula for your grid and block sizes? Also, what is the output from “-Mcuda=4.1,ptxinfo”?
You are limited in the number of threads per block as well as the number of blocks. Also, there is a limit to the amount of resources (registers, shared memory) per block. To check the limits of your card, see the output from pgaccelinfo. To see the amount of shared memory and registers used by your kernel, see the output from “-Mcuda=ptxinfo”.
Note that often programmers will write kernels to handle a single element in an array. Hence, their block and grid size is proportional to the input size. However, if the input size grows to big, then the program will run out of resources. The solution is to create a fixed size schedule (grid and block size) and write the kernels to handle multiple elements in an array. This is what I needed to do in my article on Tuning a Monte Carlo Algorithm on GPUs.
I tuned the number of threads based on what gave me the best performance. I basically just tried a few values and settled on 64. My definitions look like this:
numThreads = 64
numBlocks = (M + numThreads - 1) / numThreads
ptxinfo provides this info:
ptxas info : Compiling entry function ‘anaconc_model_gpu’ for ‘sm_20’
ptxas info : Function properties for anaconc_model_gpu
1840 bytes stack frame, 480 bytes spill stores, 508 bytes spill loads
ptxas info : Used 63 registers, 180+0 bytes smem, 196 bytes cmem[0], 144 bytes cmem[2], 8 bytes cmem[14], 72 bytes cmem[16]
I have a bunch of variables defined in my kernel that I didn’t explicitly state as shared. I assumed these for the most part went to the thread local memory and therefore were accessed slow. My plan was to eventually change this to use all shared memory, and then derive the number of threads such that I don’t exceed shared memory.
I also noticed if I make my # of threads like 512 or 1024 the performance get’s significantly worse, which I’m guessing means I’m using up all the registers with this many threads.
Also, I did code my kernel such that it can handle multiple blocks, and the total number of values in my arrays doesn’t have to be perfectly divisible by the # of threads per block.
I guess I still don’t understand why I’m unable to transfer these values back from the device. Each thread computes the updated values for these arrays and the last thing each thread does is send it’s current value back to the global array in the device memory. Do you have any idea what part of that process is exceeding it’s limitations?
Hi jeremyw,
Do you have any idea what part of that process is exceeding it’s limitations?
Nope, sorry. Can you send a reproducing example to PGI Customer Support (trs@pgroup.com) and ask them to forward it to me?
Thanks,
MAt