help me with my first CUDA Fortran program.

I download pgi workstation complete 10.3 Linux x86_64 , read the pdf files and wrote my first cuda fortran program , but I failed to compile it.

thanks a lot.

module cacmu
use cudafor
attributes(global) subroutine cac(n,x)
implicit none
integer :: n
real :: x
integer :: i
do i=1,N
end subroutine cac
end module cacmu

program main
use cudafor
use cacmu
implicit none
integer :: n_=1000000*64
real :: x_
call cac<<<n_/64,64>>>(n_,x_)
print *,x_
end program main

then I compiled it as following
$ pgf95 1.cuf
PGF90-S-0188-Argument number 1 to cac: type mismatch (1.cuf: 22)
PGF90-S-0188-Argument number 2 to cac: type mismatch (1.cuf: 22)
0 inform, 0 warnings, 2 severes, 0 fatal for main

My GPU card is a Gigabyte GT240 1GB DDR5 which should support cuda sm1.2 .

[root]# pgaccelinfo
CUDA Driver Version 3000

Device Number: 0
Device Name: GeForce GT 240
Device Revision Number: 1.2
Global Memory Size: 1073020928
Number of Multiprocessors: 12
Number of Cores: 96
Concurrent Copy and Execution: Yes
Total Constant Memory: 65536
Total Shared Memory per Block: 16384
Registers per Block: 16384
Warp Size: 32
Maximum Threads per Block: 512
Maximum Block Dimensions: 512, 512, 64
Maximum Grid Dimensions: 65535 x 65535 x 1
Maximum Memory Pitch: 2147483647B
Texture Alignment 256B
Clock Rate: 1462 MHz
Initialization time: 14333 microseconds
Current free memory 1020030976
Upload time (4MB) 6511 microseconds (1371 ms pinned)
Download time 3893 microseconds ( 961 ms pinned)
Upload bandwidth 644 MB/sec (3059 MB/sec pinned)
Download bandwidth 1077 MB/sec (4364 MB/sec pinned)

You have several problems with your code and compilation. The two that will prevent you from compiling are that in your kernel you should have integer, value :: n, real, value :: x. You must also compile with -Mcuda. Another problem with your code is that I don’t think you understand how GPU programming works. You have no references to threads. Perhaps you are thinking of Accelerator instead (#pragma acc).

Hi l4linux,

In CUDA, every thread will execute the same kernel. In your example, you have all threads sequentially summing a value. As BeachHut suggests, you could get this code running, but I doubt it would be very fast nor what you intended.

While you can perform a sum reduction in parallel (I touch upon it in my last PGI Insider Article), it’s rather difficult. Instead, you should consider starting with a simple Matmul program (See:

Hopefully, this will get you started. If not, please let us know and we’ll try to help further.

  • Mat

Oh, Yes , I’m so stupid.
Thank you all, for your messages.