how to optimize parallel computation using cuda fortran


I have a code in fortran (does allot of fluid mechanics calculations) that I think I can improve the speed of it by using cuda fortran way of parallel computation, and I was able to write a code and calculate the output of multiplication of two dimensional matrices (400,400) in almost 1/80 of the time needed by the cpu to perform the same multiplication.
But, unfortunately, when you want to copy from device memory to host memory (you need to do that in case you want to output it to a file, or do further processing on CPU), it will take ridiculously long time that it is not worth it to perform the task on GPU anymore! I hope this is not the case.
does any one know what mistake I did?
can you please help? I found that there is little (or almost nothing) written guidance on how to program using cuda fortran, not like cuda c.

please advice.


Hi Dolf,

How long is “ridiculously long” and “almost 1/80 of the time needed”? How are you measuring your performance?

How are you copying your arrays? Copying whole arrays is often faster then copying sub-arrays. A whole array can be copied in one contiguous block while sub-arrays need to be copied in many small chunks. Given the high overhead of data movement, I find that the minimising the number of copies is more important then the amount of data being copied. For sub-arrays, it’s better to gather the arrays into a single block before coping. I touch upon this topic in my article Multi-GPU Programming Using CUDA Fortran, MPI, and GPUDirect

If you are copying the whole array using array syntax (i.e. Arr_host = Arr_device ), add the “pinned” attribute to your host array. This will request, but not guarantee, that the OS put the array in physical non-swapping memory, i.e. “pinned”. In order to perform a DMA transfer (i.e. copy data to the device), the memory must be pinned. So without the “pinned” attribute, the memory must first be copied from virtual to physical memory and then transferred to the device. The “pinned” attribute eliminates the need for this extra copy. The caveat of using “pinned” is that this data is managed by the CUDA device driver. Hence, if you destroy your CUDA context, this memory will be destroyed as well.

If both of these don’t help, then your stuck. Your only other options are to reduce the frequency of the copies or increase you computation on the device.

  • Mat

Another thing to check is that your device is performing as expected. I’ve seen instances where the card was installed in the wrong PCI slot with a single channel as opposed to a quad channel giving 1/4th the memory performance. You can check this by running the “pgaccelinfo” utility and look at the bandwidth test. For comparison, here’s the output from my C2070 system.

% pgaccelinfo
CUDA Driver Version: 4010
NVRM version: NVIDIA UNIX x86_64 Kernel Module 285.05.33 Thu Jan 19 14:07:02 PST 2012

Device Number: 0
Device Name: Tesla C2070
Device Revision Number: 2.0
Global Memory Size: 6441598976
Number of Multiprocessors: 14
Number of Cores: 448
Concurrent Copy and Execution: Yes
Total Constant Memory: 65536
Total Shared Memory per Block: 49152
Registers per Block: 32768
Warp Size: 32
Maximum Threads per Block: 1024
Maximum Block Dimensions: 1024, 1024, 64
Maximum Grid Dimensions: 65535 x 65535 x 65535
Maximum Memory Pitch: 2147483647B
Texture Alignment: 512B
Clock Rate: 1147 MHz
Execution Timeout: No
Integrated Device: No
Can Map Host Memory: Yes
Compute Mode: default
Concurrent Kernels: Yes
ECC Enabled: No
Memory Clock Rate: 1494 MHz
Memory Bus Width: 384 bits
L2 Cache Size: 786432 bytes
Max Threads Per SMP: 1536
Async Engines: 2
Unified Addressing: Yes
Initialization time: 31185 microseconds
Current free memory: 5934874624
Upload time (4MB): 958 microseconds ( 711 ms pinned)
Download time: 1040 microseconds ( 672 ms pinned)
Upload bandwidth: 4378 MB/sec (5899 MB/sec pinned)
Download bandwidth: 4032 MB/sec (6241 MB/sec pinned)

thanks Mat for your prompt reply, I have two questions to ask you:

  1. how can I run the accelinfo?? do I run a special command in cmd??
  2. how can I check that memory bandwidth are the optimal value?? I am new to this technology (NVIDIA and CUDA)
    I don’t have the Tesla, I wished that our budget can afford to buy one, instead I thought I should start something simple like NVIDIA GeForce 460 graphic card, installed on a Dell XPS 8300 (8 i7 cores), hope to run a test to see what is my GPU capable of.
    also, regarding your question, what is the time needed to copy, I have developed the code bellow, please look at it, it might make it easier for you to understand what I am trying to do, I made that by my self with my little knowledge of cuda fortran since there is no books available other than some cuda c.
    I used a function (CALL DATE AND TIME) to calculate the run time to see how many milli second (1/1000 of second) needed to do multiply by the gpu kernel, it was 1 msec, for the same task done by CPU, it took the cpu 80 msec, so that’s a big achievement for me, but, when I try to copy from gpu to cpu memory, there I get the big shock! it took 227 msec to copy! its like 3 time longer than the cpu doing the multiplication, but I did not use pinned attribute, this will be my next task.

here is the code:

module variables

implicit none
integer :: N = 500
integer :: M = 400
integer :: L = 500
real :: sum = 0.0
end module variables

module mmul_mod
use cudafor
use variables

attributes (global) subroutine mmul_kernel (A,B,C, N,M,L)
integer , value :: N,M,L
real :: A(N,M) , B(M,L) , C(N,L)
real :: sum
integer :: k

blkidx = blockidx%x
blkidy = blockidx%y

sum = 0.0

do k =1, M
sum = sum + (A(blkidx,k) * B(k,blkidy))
C(blkidx,blkidy) = sum
sum = 0.0

end subroutine mmul_kernel

end module mmul_mod
program mat_mult
use variables
use mmul_mod
implicit none

integer i,j,k
real, dimension (N,M) :: A
real, dimension (M,L) :: B
real, dimension (N,L) :: C
real, device, allocatable, dimension (:,:) :: Adev, Bdev, Cdev
integer :: start_time(8), end_time(8)
type(dim3) :: blocks

allocate (Adev(N,M), Bdev(M,L), Cdev(N,L))

Adev = A(1:N,1:M)
Bdev (:,:) = B(1:M,1:L)
blocks = dim3(N, L, 1)

call mmul_kernel <<>> (Adev, Bdev, Cdev, N,M,L)

C(1:N,1:L) = Cdev
deallocate (Adev, Bdev, Cdev)

end program mat_mult

how can I run the accelinfo?? do I run a special command in cmd??

You can run “pgaccellinfo” from a command line shell. Are you using PVF? If so, then open a PGI DOS command shell from the “Start” menu.

  1. how can I check that memory bandwidth are the optimal value??

Actually, I’m not really sure. The product spec page lists the memory bandwidth but this is the on-chip bandwidth, not the host to device. I suspect that it will depend upon a lot of factors besides the card itself.

here is the code:

First off, your timing code is incorrect. Kernels are launched asynchronously from the host code. Hence, after the call the host continues until it reaches a synchronisation point, which in this case is the data transfer of the C array. So here you’re timing the data transfer to the device plus a little overhead of calling the kernel, but not the kernel itself. To fix, either add a call to “cudaThreadSynchronize” just after your kernel call, or better yet, use CUDA events to time the code. In my article Tuning a Monte Carlo Algorithm on GPUs, I show an example of using cuda events.

Did you mean to only use a block? This will give you very poor performance. So I suspect that the problem here is not the data copy, rather how you are scheduling your kernel. Note that PVF ships with an example CUDA Fortran Matmul projects and the Workstation products ship with an optimised example sgemm.cuf.

  • Mat