cuda fortran questions

Greetings,

I am very new to cuda fortran, and have been struggling with getting my own (F77/F90) code to work with cuda. Actually, the program does run, and even gives correct results, but is very slow. I used nvprof to see how the program is doing in GPU, and the results were very encouraging:

Time(%) Time Calls Avg Min Max Name
63.54 4.60s 23581 195.01us 194.66us 1.13ms wavefundercuda
32.97 2.39s 23582 101.20us 76.94us 1.21ms primdercuda
2.58 186.70ms 188685 989ns 800ns 11.04us [CUDA memcpy HtoD]
0.91 65.85ms 47163 1.40us 1.31us 22.08us [CUDA memcpy DtoH]
0.00 17.37us 1 17.37us 17.37us 17.37us denmatcuda

It seems it spent just under 7 seconds calculating things in GPU (these subroutines account for over 90% of the total runtime - I have profiled my serial version with both pgprof and gprof), but overall the elapsed time was over 3 minutes(!). For comparison, the elapsed time for a serial version is 43 seconds. I could not figure out what was going on until I looked at the program with ‘top’
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
6459 avolkov 20 0 76.1g 51m 20m R 100 0.1 0:05.84 denprop

  • a whooping 76 GB of virtual memory, no wonder it runs slow. Of course, the serial version does not use anything close to even a gig of ram:

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
7439 avolkov 20 0 14856 2300 1156 R 100 0.0 0:03.56 denprop

Huge difference!

The serial version was compiled using the following flags
pgf90 -Mextend -Mbackslash -fast -Minfo=ccff
while for cuda I had:
pgf90 -Mextend -Mbackslash -fast -Minfo=ccff -DUSE_CUDA -lstdc++ -Mcuda

I thought there were issues with my CUDA code (and probably there are many!), but then i compiled and ran stream_cudafor.cuf (renamed to stream_cudafor.f, and changed ntimes to 100)
pgfortran stream_cudafor.f -o stream_cudafor -lstdc++ -Mcuda
, and got the same virtual memory issue:

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
7834 avolkov 20 0 76.3g 325m 18m R 100 0.5 0:01.77 stream_cudafor

Compiling the original version with
pgf90 -Mfixed -O2 stream_cudafor.cuf -o stream_cudafor -lstdc++
gives the same memory issue

Clearly, I am doing something wrong when compiling cuda programs, or there is something wrong with my installation of PGI and GCC compilers…

I use OpenSUSE 12.1 x86_64, kernel 3.1.10-1.16-desktop

pgf90 --version

pgf90 12.5-0 64-bit target on x86-64 Linux -tp bulldozer
Copyright 1989-2000, The Portland Group, Inc. All Rights Reserved.
Copyright 2000-2012, STMicroelectronics, Inc. All Rights Reserved.

Because by default, OpenSUSE comes with 4.6.2 compiler which does not seem to be compatible with cuda fortran:

pgf90 stream_cudafor.f -o stream_cudafor -lstdc++ -Mcuda=cc20
In file included from /usr/local/pgi/linux86-64/2012/cuda/4.0/include/cuda_runtime.h:59:0,
from /tmp/pgcudaforar_baPUMKUle.gpu:1:
/usr/local/pgi/linux86-64/2012/cuda/4.0/include/host_config.h:82:2: error: #error – unsupported GNU version! gcc 4.5 and up are not supported!
PGF90-F-0000-Internal compiler error. Device compiler exited with error status code 0 (stream_cudafor.f: 167)
PGF90/x86-64 Linux 12.5-0: compilation aborted

I had to download, compile and install gcc 4.4.7

The reason, I have to add -lstdc++ switch is because if I do not do that, I get:

pgf90 stream_cudafor.f -o stream_cudafor -Mcuda
/usr/bin/ld: /usr/local/pgi/linux86-64/12.5/lib/libcudafor4.a(pgi_memset.o): undefined reference to symbol ‘__gxx_personality_v0@@CXXABI_1.3
/usr/bin/ld: note: ‘__gxx_personality_v0@@CXXABI_1.3’ is defined in DSO /usr/local/gcc-4.4.7/lib64/libstdc++.so.6 so try adding it to the linker command line
/usr/local/gcc-4.4.7/lib64/libstdc++.so.6: could not read symbols: Invalid operation

I have a GTX 460 graphics card:

avolkov@wizard:/usr/local/cuda4.2/NVIDIA_GPU_Computing_SDK/C/bin/linux/release> ./deviceQuery
[deviceQuery] starting…

./deviceQuery Starting…

CUDA Device Query (Runtime API) version (CUDART static linking)

Found 1 CUDA Capable device(s)

Device 0: “GeForce GTX 460”
CUDA Driver Version / Runtime Version 5.0 / 4.2
CUDA Capability Major/Minor version number: 2.1
Total amount of global memory: 1024 MBytes (1073283072 bytes)
( 7) Multiprocessors x ( 48) CUDA Cores/MP: 336 CUDA Cores
GPU Clock rate: 1430 MHz (1.43 GHz)
Memory Clock rate: 1800 Mhz
Memory Bus Width: 256-bit
L2 Cache Size: 524288 bytes
Max Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536,65535), 3D=(2048,2048,2048)
Max Layered Texture Size (dim) x layers 1D=(16384) x 2048, 2D=(16384,16384) x 2048
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 32768
Warp size: 32
Maximum number of threads per multiprocessor: 1536
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1024 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 65535
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and execution: Yes with 1 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Concurrent kernel execution: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support enabled: No
Device is using TCC driver mode: No
Device supports Unified Addressing (UVA): Yes
Device PCI Bus ID / PCI location ID: 4 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 5.0, CUDA Runtime Version = 4.2, NumDevs = 1, Device = GeForce GTX 460

What I am doing wrong? Any help is greatly appreciated.

Thank you,
Anatoliy

Hi Anatoliy,

Let’s start with the easy one:

pgf90 stream_cudafor.f -o stream_cudafor -lstdc++ -Mcuda=cc20
In file included from /usr/local/pgi/linux86-64/2012/cuda/4.0/include/cuda_runtime.h:59:0,
from /tmp/pgcudaforar_baPUMKUle.gpu:1:
/usr/local/pgi/linux86-64/2012/cuda/4.0/include/host_config.h:82:2: error: #error – unsupported GNU version! gcc 4.5 and up are not supported!

By default, release 12.5 uses CUDA 4.0 which doesn’t support gcc > 4.5. The work around is to use CUDA 4.1 by adding the “4.1” or “cuda4.1” sub-option to -Mcuda.

pgf90 stream_cudafor.f -o stream_cudafor -lstdc++ -Mcuda=cc20,cuda4.1

As for the memory problem, I’m not able to recreate the issue on my GTX460 system. My openSuse 12.1 system has a GTX690 card but again I don’t see the issue. My best guess is that it has to do with the downgrading of the GCC version and the use of the stdc++ library (I didn’t try recreating this)

If you have access to another system, I’d like to know if you see the same problem? Also, can you revert back to using GCC 4.6 and add the CUDA 4.1 flag? Finally, the just released PGI 12.6 version defaults to using CUDA 4.2 and is another thing to try.

Hope this helps,
Mat

Hello Mat,

Many thanks for your prompt reply.

When using -Mcuda=cuda4.0 option, I get:

avolkov@wizard:/data1/avolkov/src/cuda> pgf90 stream_cudafor.f -o stream_cudafor -Mcuda=cuda4.0
In file included from /usr/local/pgi/linux86-64/2012/cuda/4.0/include/cuda_runtime.h:59:0,
from /tmp/pgcudaforn7VfNlhJr3fJ.gpu:1:
/usr/local/pgi/linux86-64/2012/cuda/4.0/include/host_config.h:82:2: error: #error – unsupported GNU version! gcc 4.5 and up are not supported!
PGF90-F-0000-Internal compiler error. Device compiler exited with error status code 0 (stream_cudafor.f: 167)
PGF90/x86-64 Linux 12.5-0: compilation aborted

When using -Mcuda=cuda4.1 option, I get:

avolkov@wizard:/data1/avolkov/src/cuda> pgf90 stream_cudafor.f -o stream_cudafor -Mcuda=cuda4.1
In file included from /usr/local/pgi/linux86-64/2012/cuda/4.1/include/cuda_runtime.h:59:0,
from /tmp/pgcudaforZoXfzC0dVaTr.gpu:1:
/usr/local/pgi/linux86-64/2012/cuda/4.1/include/host_config.h:82:2: error: #error – unsupported GNU version! gcc 4.6 and up are not supported!
PGF90-F-0000-Internal compiler error. Device compiler exited with error status code 0 (stream_cudafor.f: 167)
PGF90/x86-64 Linux 12.5-0: compilation aborted

I guess the only option left is to try the newest release 12.6

Hello Mat,

I have upgraded my installation of PGI compiler suite to version 12.6.

avolkov@wizard:/data1/avolkov/src/cuda> pgf90 --version

pgf90 12.6-0 64-bit target on x86-64 Linux -tp bulldozer
Copyright 1989-2000, The Portland Group, Inc. All Rights Reserved.
Copyright 2000-2012, STMicroelectronics, Inc. All Rights Reserved.

I can now compile and link stream_cudafor without any issues:

avolkov@wizard:/data1/avolkov/src/cuda> pgf90 stream_cudafor.f -o stream_cudafor -Mcuda=cc20,cuda4.2

avolkov@wizard:/data1/avolkov/src/cuda> ldd stream_cudafor
linux-vdso.so.1 => (0x00007fff225ff000)
libcudart.so.4 => /usr/local/cuda4.2/cuda/lib64/libcudart.so.4 (0x00002b1b9f4e9000)
libnuma.so => /usr/local/pgi/linux86-64/12.6/lib/libnuma.so (0x00002b1b9f745000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x00002b1b9f86c000)
librt.so.1 => /lib64/librt.so.1 (0x00002b1b9fa89000)
libm.so.6 => /lib64/libm.so.6 (0x00002b1b9fc92000)
libc.so.6 => /lib64/libc.so.6 (0x00002b1b9fee9000)
libdl.so.2 => /lib64/libdl.so.2 (0x00002b1ba0279000)
libstdc++.so.6 => /usr/lib64/libstdc++.so.6 (0x00002b1ba047e000)
libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00002b1ba0788000)
/lib64/ld-linux-x86-64.so.2 (0x00002b1b9f2c6000)

However, when running stream_cudafor I can still see that the amount of virtual memory used by stream_cudafor is 76.4 GB, and even the resident size is 325 mb

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
6658 avolkov 20 0 76.4g 325m 18m R 101 0.5 0:01.82 stream_cudafor
I guess the only reason I can actually run it is because my workstation has 64 GB of RAM or 64 GB of swap space).

I now think there is something wrong with my OpenSUSE system configuration, but I have no idea where to even start

I think I will get a trial version of PGI compiler and test it on another system

I have just tried to run the stream_cudafor created as shown above on another machine
that has both cuda4.2 and pgi installation nfs mounted from the first one. This second workstation also runs OpenSUSE 12.1 x86-64, but has different hardware (different mainboard, cpu and memory) and instead of GTX 460 it has Quadro 5000:

avolkov@viz01:/data1/avolkov/src/cuda> /usr/local/cuda4.2/NVIDIA_GPU_Computing_SDK/C/bin/linux/release/deviceQuery
[deviceQuery] starting…

/usr/local/cuda4.2/NVIDIA_GPU_Computing_SDK/C/bin/linux/release/deviceQuery Starting…

CUDA Device Query (Runtime API) version (CUDART static linking)

Found 1 CUDA Capable device(s)

Device 0: “Quadro 5000”
CUDA Driver Version / Runtime Version 5.0 / 4.2
CUDA Capability Major/Minor version number: 2.0
Total amount of global memory: 2560 MBytes (2683895808 bytes)
(11) Multiprocessors x ( 32) CUDA Cores/MP: 352 CUDA Cores
GPU Clock rate: 1026 MHz (1.03 GHz)
Memory Clock rate: 1500 Mhz
Memory Bus Width: 320-bit
L2 Cache Size: 655360 bytes
Max Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536,65535), 3D=(2048,2048,2048)
Max Layered Texture Size (dim) x layers 1D=(16384) x 2048, 2D=(16384,16384) x 2048
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 32768
Warp size: 32
Maximum number of threads per multiprocessor: 1536
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1024 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 65535
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and execution: Yes with 2 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Concurrent kernel execution: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support enabled: No
Device is using TCC driver mode: No
Device supports Unified Addressing (UVA): Yes
Device PCI Bus ID / PCI location ID: 1 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 5.0, CUDA Runtime Version = 4.2, NumDevs = 1, Device = Quadro 5000
[deviceQuery] test results…
PASSED

exiting in 3 seconds: 3…2…1…done!


So, when on the second machine (viz01), I can see the stream_cudafor excutable and can verify that all appropriate libraries are available:

avolkov@viz01:/data1/avolkov/src/cuda> ldd ./stream_cudafor
linux-vdso.so.1 => (0x00007fff1b7ff000)
libcudart.so.4 => /usr/local/cuda4.2/cuda/lib64/libcudart.so.4 (0x00002b2b759f6000)
libnuma.so => /usr/local/pgi/linux86-64/12.6/lib/libnuma.so (0x00002b2b75c52000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x00002b2b75d73000)
librt.so.1 => /lib64/librt.so.1 (0x00002b2b75f90000)
libm.so.6 => /lib64/libm.so.6 (0x00002b2b76199000)
libc.so.6 => /lib64/libc.so.6 (0x00002b2b763f0000)
libdl.so.2 => /lib64/libdl.so.2 (0x00002b2b7677f000)
libstdc++.so.6 => /usr/lib64/libstdc++.so.6 (0x00002b2b76984000)
libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00002b2b76c8e000)
/lib64/ld-linux-x86-64.so.2 (0x00002b2b757d3000)

However, when running it, I get the following:
avolkov@viz01:/data1/avolkov/src/cuda> ./stream_cudafor
Error: illegal instruction, illegal opcode
rax 0000000000000001, rbx 000000000000001e, rcx 0000000000000000
rdx 00002ae6efd94450, rsp 00007fff06b31c20, rbp 00007fff06b31c50
rsi 0000000000000009, rdi 00000000ffffffff, r8 000000000000ffff
r9 0000000000000001, r10 00002ae6efa87a20, r11 000000000000000b
r12 0000000000000001, r13 00007fff06b31e20, r14 0000000000000000
r15 0000000000000000
— traceback not available
Abort

The way I see it, both GTX 460 and Quadro 5000 should support CUDA 2.0 (that is how my executable was compiled), but for some reason it does not work…

Does this information help in any way ?

Tried to recompile the code on the Quadro 5000 machine, but even after I’ve installed trial keys, I get:
avolkov@viz01:/data1/avolkov/src/cuda> pgf90 stream_cudafor.f -o stream_cudafor -Mcuda=cc20,cuda4.2
pgi-f95-lin64: LICENSE MANAGER PROBLEM: Failed to checkout license

I have downloaded and compiled NPB2.3 - FT Benchmark - C + CUDA
(http://hpcgpu.codeplex.com/releases/view/34770)
When compiled using nvcc, it shows exactly the same behavior as my fortran test codes:

nvcc c_randdp.cu ft.cu wtime.cu c_timers.cu c_print_results.cu -o cuda_ft.exe -I /usr/local/cuda/NVIDIA_GPU_Computing_SDK/C/common/inc -O3 -arch sm_13

ldd cuda_ft.exe
linux-vdso.so.1 => (0x00007fffc9bff000)
libcudart.so.4 => /usr/local/cuda/lib64/libcudart.so.4 (0x00002ba475bb0000)
libstdc++.so.6 => /usr/lib64/libstdc++.so.6 (0x00002ba475e39000)
libm.so.6 => /lib64/libm.so.6 (0x00002ba476143000)
libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00002ba47639a000)
libc.so.6 => /lib64/libc.so.6 (0x00002ba4765b1000)
libdl.so.2 => /lib64/libdl.so.2 (0x00002ba476941000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x00002ba476b45000)
librt.so.1 => /lib64/librt.so.1 (0x00002ba476d63000)
/lib64/ld-linux-x86-64.so.2 (0x00002ba47598d000)

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
19528 avolkov 20 0 76.2g 150m 12m R 100 0.2 0:03.98 cuda_ft.exe

btw, this is on a different OpenSUSE 12.1 x86_64 machine that also has a GTX 460 card:

/usr/local/cuda/NVIDIA_GPU_Computing_SDK/C/bin/linux/release/deviceQuery Starting…

CUDA Device Query (Runtime API) version (CUDART static linking)

Found 1 CUDA Capable device(s)

Device 0: “GeForce GTX 460”
CUDA Driver Version / Runtime Version 4.2 / 4.2
CUDA Capability Major/Minor version number: 2.1
Total amount of global memory: 1024 MBytes (1073283072 bytes)
( 7) Multiprocessors x ( 48) CUDA Cores/MP: 336 CUDA Cores
GPU Clock rate: 1430 MHz (1.43 GHz)
Memory Clock rate: 1800 Mhz
Memory Bus Width: 256-bit
L2 Cache Size: 524288 bytes
Max Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536,65535), 3D=(2048,2048,2048)
Max Layered Texture Size (dim) x layers 1D=(16384) x 2048, 2D=(16384,16384) x 2048
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 32768
Warp size: 32
Maximum number of threads per multiprocessor: 1536
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1024 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 65535
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and execution: Yes with 1 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Concurrent kernel execution: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support enabled: No
Device is using TCC driver mode: No
Device supports Unified Addressing (UVA): Yes
Device PCI Bus ID / PCI location ID: 4 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 4.2, CUDA Runtime Version = 4.2, NumDevs = 1, Device = GeForce GTX 460
[deviceQuery] test results…
PASSED

exiting in 3 seconds: 3…2…1…done!

Took cuda_ft.exe executable created on the GTX 460 machine qnd ran it on the Quadro 5000 machine (both run OpenSUSE 12.1 x86_64, but have different motherboards and cpus: GTX460 machine ASUS KGPE-D16 server board, 2 x AMD Opteron 6234 processors, 64GB DDR3 ECC RAM; Quadro 5000: ASUS SABERTOOTH 990FX maniboard, AMD Phenom II X6 1090T proc, 16 GB DDR3 )
Now the executable asks for 28.8 GB of virtual memory:

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
24579 avolkov 20 0 28.2g 156m 21m R 101 1.0 0:01.85 cuda_ft.exe

UPDATE 1: All my test fortran codes do the same: ask for 76 GB of virtual memory on the Opteron 6234 machine, and 28.8 GB on the Phenom II machine.
The reason I could not run stream_cudafor compiled on the Opteron 6234 machine on Phenom II computer is because by default, pgf90 uses ‘-tp bulldozer’ flag. Opteron 6234 is indeed based on the Bulldozer architecture, but Phenom II is a K10 cpu. Using ‘-tp x64’ flag fixed this compatibility problem
Unfortunately, my own code still runs very slow in the GPU version (no matter whether it runs on GTX450 or Quadro 5000), but I now think this is related to my programming rather than any other issues.

Hi Anatoily,

Just to be clear, you believe the issue is with your particular system running the GTX460? You’re the third or fourth person that has had a non-reproducible error with these cards. The problems are all different with the only commonality being the GTX460. For the last one, I had our IT build me a system with a GTX460, but still couldn’t reproduce the problem.

With the GTX line, only the GPU chip is made by NVIDIA. The rest is assembled by various third parties. The Quadro and Tesla brands are made by NVIDIA so typically have a high quality standard and why NVIDIA recommends using GTX for only graphics.

I have no idea if a flaky card is the root cause of the problem (doubtful since it’s virtual memory), poor interaction with the CUDA driver (more likely), or something else entirely (most likely).

Unfortunately, my own code still runs very slow in the GPU version (no matter whether it runs on GTX450 or Quadro 5000), but I now think this is related to my programming rather than any other issues.

Have you profiled your code to see where the time is coming from?

Three things to look for:

  • Excessive data movement.
  • Not enough parallelism.
  • Data access on the device.

    \
  • Mat

Hello Mat,

Honestly, I do not know what to think, because I seem to have the same issue on Quadro 5000. Let me try to explain what has been happening with my code. I have profiled a serial version of my code with pgprof and gprof. There are several subroutines that account for almost 99% of the total cpu time. These subroutines are being constantly called by the main code. The good news is what these subroutines do is parallel in nature. I modified my code to run these subroutines on GPU. I have profiled the cuda-enabled executable using nvprof and I can see that a) calculations in GPU are done very fast (seconds), and b) memcpy in and out of GPU is also done fairly quickly
(~0.2 sec):

Opteron 6234 + GTX 460:


nvprof
Time(%) Time Calls Avg Min Max Name
64.87 4.78s 23581 202.83us 202.53us 1.00ms wavefundercuda
31.75 2.34s 23582 99.27us 75.81us 2.29ms primdercuda
2.51 184.73ms 188685 979ns 800ns 9.76us [CUDA memcpy HtoD]
0.88 64.62ms 47163 1.37us 1.31us 14.94us [CUDA memcpy DtoH]
0.00 15.97us 1 15.97us 15.97us 15.97us denmatcuda

top
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
32697 avolkov 20 0 76.1g 68m 20m R 100 0.1 0:46.17 denprop


Phenom II + Quadro 5000:


nvprof
Time(%) Time Calls Avg Min Max Name
69.80 6.18s 23581 262.26us 261.77us 1.31ms wavefundercuda
27.09 2.40s 23582 101.78us 81.93us 2.21ms primdercuda
2.24 198.78ms 188685 1.05us 864ns 9.98us [CUDA memcpy HtoD]
0.86 76.53ms 47163 1.62us 1.57us 17.60us [CUDA memcpy DtoH]
0.00 16.10us 1 16.10us 16.10us 16.10us denmatcuda

top
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
27488 avolkov 20 0 28.1g 27m 22m R 100 0.2 0:13.33 denprop

All executable and input files for these two runs were the same, but there is a huge difference in terms of memory usage (virtual: 76 gb vs 28 gb, resident: 68 mb vs 27 mb, shared: 20 mb vs 22 mb). For comparison, for serial executable, the memory usage is much smaller- 17 mb of virtual memory, 3 mb of resident, and 1.4 mb of shared:

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
29900 avolkov 20 0 17212 2668 1448 R 100 0.0 0:21.92 denprop

Note that in the CUDA version, no extra arrays were allocated on the host compared to the serial version. All new arrays are allocated on device. Why such a great difference in memory usage? It does not make sense to me.

Also the overall elapsed time of both of these cuda-enable runs (~3.5 min) is much greater than that of even a serial version (~40 sec). I now believe it is related to the fact that cuda-enabled executable requests too much of virtual memory (gigabytes). Exactly how much seems to depend on the host hardware (75 gb on Opteron 6234 and 28 gb on Phenom II). I do not know how the graphics card is related to all that: all my Opteron 6xxx machines have GTX 460, and Quadro 5000s are only available on Phenom II machines).

I know it may sound strange, but do you think you could possibly try running my example on one of your machines (preferably, GTX 460 and OpenSUSE 12.1 x86_64) ? I would really like to know if there is a problem with my OpenSUSE installation, my hardware, or my programming.

Thanks,
Anatoliy

I know it may sound strange, but do you think you could possibly try running my example on one of your machines (preferably, GTX 460 and OpenSUSE 12.1 x86_64) ?

Not strange at. Please do send us the code to PGI Customer Service (trs@pgroup.com) and ask them to forward it to me (Mat)

Hopefully I can reproduce the problem since if there is a compiler issue we want to know about. If it doesn’t, I update my CUDA Driver to match yours and see if I can get my IT guys to install OpenSuSE 12.1 on my GTX460 system.

  • Mat