Greetings,
I am very new to cuda fortran, and have been struggling with getting my own (F77/F90) code to work with cuda. Actually, the program does run, and even gives correct results, but is very slow. I used nvprof to see how the program is doing in GPU, and the results were very encouraging:
Time(%) Time Calls Avg Min Max Name
63.54 4.60s 23581 195.01us 194.66us 1.13ms wavefundercuda
32.97 2.39s 23582 101.20us 76.94us 1.21ms primdercuda
2.58 186.70ms 188685 989ns 800ns 11.04us [CUDA memcpy HtoD]
0.91 65.85ms 47163 1.40us 1.31us 22.08us [CUDA memcpy DtoH]
0.00 17.37us 1 17.37us 17.37us 17.37us denmatcuda
It seems it spent just under 7 seconds calculating things in GPU (these subroutines account for over 90% of the total runtime - I have profiled my serial version with both pgprof and gprof), but overall the elapsed time was over 3 minutes(!). For comparison, the elapsed time for a serial version is 43 seconds. I could not figure out what was going on until I looked at the program with ‘top’
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
6459 avolkov 20 0 76.1g 51m 20m R 100 0.1 0:05.84 denprop
- a whooping 76 GB of virtual memory, no wonder it runs slow. Of course, the serial version does not use anything close to even a gig of ram:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
7439 avolkov 20 0 14856 2300 1156 R 100 0.0 0:03.56 denprop
Huge difference!
The serial version was compiled using the following flags
pgf90 -Mextend -Mbackslash -fast -Minfo=ccff
while for cuda I had:
pgf90 -Mextend -Mbackslash -fast -Minfo=ccff -DUSE_CUDA -lstdc++ -Mcuda
I thought there were issues with my CUDA code (and probably there are many!), but then i compiled and ran stream_cudafor.cuf (renamed to stream_cudafor.f, and changed ntimes to 100)
pgfortran stream_cudafor.f -o stream_cudafor -lstdc++ -Mcuda
, and got the same virtual memory issue:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
7834 avolkov 20 0 76.3g 325m 18m R 100 0.5 0:01.77 stream_cudafor
Compiling the original version with
pgf90 -Mfixed -O2 stream_cudafor.cuf -o stream_cudafor -lstdc++
gives the same memory issue
Clearly, I am doing something wrong when compiling cuda programs, or there is something wrong with my installation of PGI and GCC compilers…
I use OpenSUSE 12.1 x86_64, kernel 3.1.10-1.16-desktop
pgf90 --version
pgf90 12.5-0 64-bit target on x86-64 Linux -tp bulldozer
Copyright 1989-2000, The Portland Group, Inc. All Rights Reserved.
Copyright 2000-2012, STMicroelectronics, Inc. All Rights Reserved.
Because by default, OpenSUSE comes with 4.6.2 compiler which does not seem to be compatible with cuda fortran:
pgf90 stream_cudafor.f -o stream_cudafor -lstdc++ -Mcuda=cc20
In file included from /usr/local/pgi/linux86-64/2012/cuda/4.0/include/cuda_runtime.h:59:0,
from /tmp/pgcudaforar_baPUMKUle.gpu:1:
/usr/local/pgi/linux86-64/2012/cuda/4.0/include/host_config.h:82:2: error: #error – unsupported GNU version! gcc 4.5 and up are not supported!
PGF90-F-0000-Internal compiler error. Device compiler exited with error status code 0 (stream_cudafor.f: 167)
PGF90/x86-64 Linux 12.5-0: compilation aborted
I had to download, compile and install gcc 4.4.7
The reason, I have to add -lstdc++ switch is because if I do not do that, I get:
pgf90 stream_cudafor.f -o stream_cudafor -Mcuda
/usr/bin/ld: /usr/local/pgi/linux86-64/12.5/lib/libcudafor4.a(pgi_memset.o): undefined reference to symbol ‘__gxx_personality_v0@@CXXABI_1.3’
/usr/bin/ld: note: ‘__gxx_personality_v0@@CXXABI_1.3’ is defined in DSO /usr/local/gcc-4.4.7/lib64/libstdc++.so.6 so try adding it to the linker command line
/usr/local/gcc-4.4.7/lib64/libstdc++.so.6: could not read symbols: Invalid operation
I have a GTX 460 graphics card:
avolkov@wizard:/usr/local/cuda4.2/NVIDIA_GPU_Computing_SDK/C/bin/linux/release> ./deviceQuery
[deviceQuery] starting…
./deviceQuery Starting…
CUDA Device Query (Runtime API) version (CUDART static linking)
Found 1 CUDA Capable device(s)
Device 0: “GeForce GTX 460”
CUDA Driver Version / Runtime Version 5.0 / 4.2
CUDA Capability Major/Minor version number: 2.1
Total amount of global memory: 1024 MBytes (1073283072 bytes)
( 7) Multiprocessors x ( 48) CUDA Cores/MP: 336 CUDA Cores
GPU Clock rate: 1430 MHz (1.43 GHz)
Memory Clock rate: 1800 Mhz
Memory Bus Width: 256-bit
L2 Cache Size: 524288 bytes
Max Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536,65535), 3D=(2048,2048,2048)
Max Layered Texture Size (dim) x layers 1D=(16384) x 2048, 2D=(16384,16384) x 2048
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 32768
Warp size: 32
Maximum number of threads per multiprocessor: 1536
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1024 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 65535
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and execution: Yes with 1 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Concurrent kernel execution: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support enabled: No
Device is using TCC driver mode: No
Device supports Unified Addressing (UVA): Yes
Device PCI Bus ID / PCI location ID: 4 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 5.0, CUDA Runtime Version = 4.2, NumDevs = 1, Device = GeForce GTX 460
What I am doing wrong? Any help is greatly appreciated.
Thank you,
Anatoliy