Is the perticular CUDA toolkit version suitable for perticular application?

Dear users and developers,

Currently I am using two Tesla K40m cards for my computational work on quantum espresso (QE) suit http://www.quantum-espresso.org/. My GPU enabled QE code running very slower than normal version. My question was weather particular application will be fast only in some versions of CUDA toolkit? OR is there any other reason hindering performance (memory) of GPU?

(P.S: We don’t have any Infiband adapter HCA in server)

Current details of server are:

Server: FUJITSU PRIMERGY RX2540 M2
CUDA version: 9.0
openmpi version: 2.0.4 with intel mkl libraries
QE-gpu version : 5.4.0

########### SERVER DETAILS ##############
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 40
On-line CPU(s) list: 0-39
Thread(s) per core: 2
Core(s) per socket: 10
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 79
Model name: Intel® Xeon® CPU E5-2640 v4 @ 2.40GHz
Stepping: 1
CPU MHz: 1200.375
CPU max MHz: 3400.0000
CPU min MHz: 1200.0000
BogoMIPS: 4791.49
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 25600K
NUMA node0 CPU(s): 0-9,20-29
NUMA node1 CPU(s): 10-19,30-39
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb intel_pt tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdseed adx smap xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts

############# DEVICE DETAILS ############
CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 2 CUDA Capable device(s)

Device 0: “Tesla K40m”
CUDA Driver Version / Runtime Version 9.0 / 9.0
CUDA Capability Major/Minor version number: 3.5
Total amount of global memory: 11440 MBytes (11995578368 bytes)
(15) Multiprocessors, (192) CUDA Cores/MP: 2880 CUDA Cores
GPU Max Clock rate: 745 MHz (0.75 GHz)
Memory Clock rate: 3004 Mhz
Memory Bus Width: 384-bit
L2 Cache Size: 1572864 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Enabled
Device supports Unified Addressing (UVA): Yes
Supports Cooperative Kernel Launch: No
Supports MultiDevice Co-op Kernel Launch: No
Device PCI Domain ID / Bus ID / location ID: 0 / 2 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

Device 1: “Tesla K40m”
CUDA Driver Version / Runtime Version 9.0 / 9.0
CUDA Capability Major/Minor version number: 3.5
Total amount of global memory: 11440 MBytes (11995578368 bytes)
(15) Multiprocessors, (192) CUDA Cores/MP: 2880 CUDA Cores
GPU Max Clock rate: 745 MHz (0.75 GHz)
Memory Clock rate: 3004 Mhz
Memory Bus Width: 384-bit
L2 Cache Size: 1572864 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Enabled
Device supports Unified Addressing (UVA): Yes
Supports Cooperative Kernel Launch: No
Supports MultiDevice Co-op Kernel Launch: No
Device PCI Domain ID / Bus ID / location ID: 0 / 129 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

Peer access from Tesla K40m (GPU0) -> Tesla K40m (GPU1) : No
Peer access from Tesla K40m (GPU1) -> Tesla K40m (GPU0) : No

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 9.0, CUDA Runtime Version = 9.0, NumDevs = 2
Result = PASS

######## nvidia-smi output #############

±----------------------------------------------------------------------------+
| NVIDIA-SMI 384.90 Driver Version: 384.90 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K40m Off | 00000000:02:00.0 Off | 0 |
| N/A 40C P0 76W / 235W | 11381MiB / 11439MiB | 88% Default |
±------------------------------±---------------------±---------------------+
| 1 Tesla K40m Off | 00000000:81:00.0 Off | 0 |
| N/A 43C P0 76W / 235W | 11380MiB / 11439MiB | 89% Default |
±------------------------------±---------------------±---------------------+

Thankyou

Phanikumar

A GPU-accelerated application should see speed-up on a K40 vs a multi-core CPU. Double-check that you are building a release build, not a debug build (the performance difference can easily be 10x). The nvidia-smi output shown above suggests that the app is not actually running on the GPU, as a power consumption of 76W and the temperature of 43 deg C is not consistent with a compute application utilizing it. I would expect >= 150W and >= 65 deg C in that case. Check the application’s configuration settings.

To answer the question from the subject line: CUDA 9 supports all GPUs with compute capability >= 3.0; your GPUs have compute capability 3.5. If the application is incompatible with CUDA 9 (unlikely), you should either see a build failure or a run-time error. Your post indicates neither. You might want to seek guidance from the application vendor. They probably provide an online forum or mailing list for the purpose of supporting application users.

Thank you njuffa for your comments

Yes, I contacted application users through forum (pw-forum) they suggested below these combination (1+2+3) for QE-GPU they are using

  1. Intel PSXE 2017

  2. CUDA 6.5 or 7.0

  3. Centos 7.1

Is OS also makes any difference?

and most important question was you said “app is not actually running on the GPU, as a power consumption of 76W and the temperature of 43 deg C is not consistent with a compute application utilizing it” Is this any HARD ware issue or software issue?

TOP command of my server showing that

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
20022 xxxxx 20 0 0.158t 425392 154152 R 100.7 0.3 1463:02 pw-gpu.x

What is this ‘0.158t’, this number looking strange as compared to other servers

Thankyou once again

Phanikumar

Please use code blocks for program output, it becomes unreadable otherwise. The 0.158t could be 0.158 TB (tera byte), assuming it lines up with a memory measurement.

As I said: Check the application’s configuration settings. If in fact it offers GPU acceleration, there should be a way to specify: “run with GPU acceleration”. That might be a command line switch, a setting in a configuration file, a menu item you select in a GUI menu.

The people who can give you this information are the people who provide the application, unlikely the CUDA users in these forums.

Here is some exemplary data from nvidia-smi -q when my GPU is being used by a compute application:

Utilization
Gpu                         : 97 %
GPU Current Temp            : 82 C
GPU Shutdown Temp           : 101 C
GPU Slowdown Temp           : 96 C
Power Draw                  : 27.44 W
Power Limit                 : 39.50 W

Your numbers will be different, because you have a different GPU. You can see that the GPU utilization is close to 100% and that temperature and power consumption are >= 70% of maximum.

thank you njuffa for your quick reply

I will get back to you after I contact the application forum people

Phanikumar

Does the app come with documentation (e.g. a manual)? If so, it would be could to study that carefully before posting in app-relevant forums.

Ya it have the documentation but there is no information regarding installation in GPU. Here is link:http://www.quantum-espresso.org/wp-content/uploads/Doc/user_guide.pdf

But they provided this in GITHUB: https://github.com/fspiga/qe-gpu

Thank you
Phanikumar

Have you built the app using PGI Fortran, as indicated in the README file on Github?

Thank you tera

I thought that it (PGI compiler) will build with CUDA toolkit-9.0 so I didn’t separately installed. If it (compiler) is not installed, it may show some error, but in my case it’s not (my intuition). This conclusion I made because previously I contacted nvidia customer care and they asked me about bug report, they said their is no problem in installation. If it (PGI compiler installation) is, can you please explain importance of that?

P.S: I don’t know what version of PGI compiler I used when I installed

Thank you

Phanikumar