Originally published at: https://developer.nvidia.com/blog/easy-introduction-cuda-fortran/
CUDA Fortran for Scientists and Engineers shows how high-performance application developers can leverage the power of GPUs using Fortran. This post is the first in a series on CUDA Fortran, which is the Fortran interface to the CUDA parallel computing platform. If you are familiar with CUDA C, then you are already well on your…
Hi dear all
I am a beginner to CUDA (CUDA FORTRAN) . I have installed PGI 13.9 but the problem is, when I try to debug even a very simple CUDA Fortran code I get plenty of errors such as below:
Error 1 unresolved external symbol cudaSetupArgument referenced in function mathops_saxpy_ saxp.obj
Error 2 unresolved external symbol cudaLaunch referenced in function mathops_saxpy_ saxp.obj
Error 3 unresolved external symbol __cudaRegisterFatBinary referenced in function mathops_saxpy_ saxp.obj
Error 4 unresolved external symbol __cudaRegisterFunction referenced in function mathops_saxpy_ saxp.obj
Error 5 unresolved external symbol __cudaUnregisterFatBinary referenced in function mathops_saxpy_ saxp.obj
Error 6 unresolved external symbol pgf90_dev_auto_alloc04 referenced in function MAIN_ saxp.obj
Error 7 unresolved external symbol pgf90_dev_copyin referenced in function MAIN_ saxp.obj
.
.
.
.Error 12 unresolved external symbol CUDAFOR saxp.obj
Furthermore, when I check the project properties I see that the CUDA FOR is not even enabled when I enable it and debug again I get the same errors. I will be appreciated if someone help me with this problem.
Thanks,
Reza
Hi,
i tried to run this script and it returned 'Max error: 2.0000'.
Where this error come from?
from the cudaDeviceProp in Fortran CUDA i got
Device Number: 0
Device name: GeForce GTX 1060 3GB
Memory Clock Rate (KHz): 4004000
Memory Bus Width (bits): 192
Peak Memory Bandwidth (GB/s): 192.19
this is work when I use pgf90 -Mcuda=cc60 -o saxpy saxpy.cuf to compile
Thanks! I have exact same problem with you. And it can be solved by your solution.
What we really need is to discuss why the different in "results" depending upon the compute capability (or compiler version)
Hi, I am experiencing the same problem as you did. It seems to me that the kernel subroutine never return any value to the host. I have tried your suggestion but still not working
I have a problem about the p2pBandwidth code on Page 128. Line 52-53 and Line 55-56 are the same. Is it right ?
50 do i = 0, nDevices -1
51 if (i == j) cycle
52 istat = cudaMemcpyPeer ( distArray (j )% a_d , j , &
53 distArray (i )% a_d , i , N)
54 istat = cudaEventRecord ( startEvent ,0)
55 istat = cudaMemcpyPeer ( distArray (j )% a_d , j , &
56 distArray (i )% a_d , i , N)
57 istat = cudaEventRecord ( stopEvent ,0)
58 istat = cudaEventSynchronize ( stopEvent )
59 istat = cudaEventElapsedTime ( time , &
60 startEvent , stopEvent )
I think Line 52-53 should be removed. Maybe I was wrong. If not, could you give me a explanation ?