Hi, I’m brand new to CUDA, so please forgive me for any error.
We’re porting our custom CUDA app to a new server, equipped with Tesla M2090 GPUs: this is the output of nvidia-smi -q command:
GPU 0:3:0
Product Name : Tesla M2090
Display Mode : Disabled
Persistence Mode : Enabled
Driver Model
Current : N/A
Pending : N/A
Serial Number : 0324111048864
GPU UUID : GPU-c6fc1d7fa74cb72d-86faeca1-8177e5ea-2bfd9939-071f0a08b251a4465dfc398f
Inforom Version
OEM Object : 1.1
ECC Object : 2.0
Power Management Object : 4.0
PCI
Bus : 3
Device : 0
Domain : 0
Device Id : 109110DE
Bus Id : 0:3:0
Fan Speed : N/A
Memory Usage
Total : 5375 Mb
Used : 9 Mb
Free : 5365 Mb
Compute Mode : Exclusive_Process
Utilization
Gpu : 0 %
Memory : 0 %
Ecc Mode
Current : Enabled
Pending : Enabled
ECC Errors
Volatile
Single Bit
Device Memory : 0
Register File : 0
L1 Cache : 0
L2 Cache : 0
Total : 0
Double Bit
Device Memory : 0
Register File : 0
L1 Cache : 0
L2 Cache : 0
Total : 0
Aggregate
Single Bit
Device Memory : 0
Register File : 0
L1 Cache : 0
L2 Cache : 0
Total : 0
Double Bit
Device Memory : 0
Register File : 0
L1 Cache : 0
L2 Cache : 0
Total : 0
Temperature
Gpu : N/A
Power Readings
Power State : P12
Power Management : Supported
Power Draw : 32.74 W
Power Limit : 225 W
Clocks
Graphics : 50 MHz
SM : 101 MHz
Memory : 135 MHz
This is the output of the same command on the “old” server:
GPU 0000:1F:00.0
Product Name : Tesla M2070-Q
Display Mode : Disabled
Persistence Mode : Enabled
Driver Model
Current : N/A
Pending : N/A
Serial Number : 0324710079394
GPU UUID : GPU-9706f3ef-15ec-8d5a-0a3d-dcc70685820e
VBIOS Version : 70.00.4E.00.05
Inforom Version
OEM Object : 1.0
ECC Object : 1.0
Power Management Object : 1.0
PCI
Bus : 0x1F
Device : 0x00
Domain : 0x0000
Device Id : 0x06DF10DE
Bus Id : 0000:1F:00.0
Sub System Id : 0x084D10DE
GPU Link Info
PCIe Generation
Max : 2
Current : 2
Link Width
Max : 16x
Current : 16x
Fan Speed : N/A
Performance State : P8
Memory Usage
Total : 5375 MB
Used : 10 MB
Free : 5365 MB
Compute Mode : Exclusive_Process
Utilization
Gpu : 0 %
Memory : 0 %
Ecc Mode
Current : Enabled
Pending : Enabled
ECC Errors
Volatile
Single Bit
Device Memory : 0
Register File : 0
L1 Cache : 0
L2 Cache : 0
Total : 0
Double Bit
Device Memory : 0
Register File : 0
L1 Cache : 0
L2 Cache : 0
Total : 0
Aggregate
Single Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Total : 0
Double Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Total : 0
Temperature
Gpu : N/A
Power Readings
Power Management : N/A
Power Draw : N/A
Power Limit : N/A
Clocks
Graphics : 270 MHz
SM : 540 MHz
Memory : 1566 MHz
Max Clocks
Graphics : 573 MHz
SM : 1147 MHz
Memory : 1566 MHz
Compute Processes : None
The problem we have is, that running a test (a Java application) on the old server takes 25-30 seconds, while the same test on the new one runs in 150-180 seconds, that is 5-6 times slower.
Is there any configuration tweak that we forgot on the way?
Or do we have to re-compile the application?
M2070 and M2090 are both Fermi parts with sm_20 architecture, and the M2090 should be a tad faster.
Given that you appear to be timing at application level, it seems the first thing you would want to do is use a profiler on both CPU and GPU to determine where the additional time is spent. It may have nothing to do with the GPU at all.
FWIW, the differently-formatted output from nvidia-smi could be an indication that different CUDA versions are being run on the two machines. It would probably be best to minimize any differences between the two platforms while you are narrowing down the source of the performance differences.
The nvidia-smi output on both machines shows that persistence mode is enabled, so that is one configuration issue that can be excluded. Check for performance-relevant CUDA environment variables like CUDA_LAUNCH_BLOCKING or CUDA_FORCE_PTX_JIT that may be set.
I would suggest double checking the power connectors and cooling of the M2090 to make sure there is no unwanted power-capping kicking in (although the 5x-6x performance difference seems too large for that to be the root cause).
Thanks njuffa for the quick answer!
I’ve checked on both machines, but the CUDA lib version is the same (4.0.17), at least as far as I can see from the content of /usr/local/cuda/lib64 directory.
Persistent mode enabled and exclusive mode have been set by us on both machines, so this is something that we can exclude as a cause of problem.
Also, I’ve found no CUDA* environment variables.
####################################################
getDetails and MHz
Assembler CPUID and RDTSC
CPU GenuineIntel, Features Code BFEBFBFF, Model Code 000206C2
Intel(R) Xeon(R) CPU E5649 @ 2.53GHz
Measured - Minimum 2533 MHz, Maximum 2533 MHz
Linux Functions
get_nprocs() - CPUs 24, Configured CPUs 24
get_phys_pages() and size - RAM Size 94.38 GB, Page Size 4096 Bytes
uname() - Linux, *****, 2.6.18-371.4.1.el5
#1 SMP Wed Jan 8 18:42:07 EST 2014, x86_64
##########################################
Linux CUDA 3.2 x64 64 Bits DP MFLOPS Benchmark 1.4 Tue Nov 18 12:43:31 2014
CUDA devices found
Device 0: Tesla M2070-Q with 14 Processors 112 cores
Global Memory 5249 MB, Shared Memory/Block 49152 B, Max Threads/Block 1024
Using 256 Threads
Test 8 Byte Ops Repeat Seconds MFLOPS First All
Words /Wd Passes Results Same
Data in & out 100000 2 2500 2.807184 178 0.9294744580218 Yes
Data out only 100000 2 2500 1.731228 289 0.9294744580218 Yes
Calculate only 100000 2 2500 0.249827 2001 0.9294744580218 Yes
Data in & out 1000000 2 250 1.712072 292 0.9925431921162 Yes
Data out only 1000000 2 250 0.955676 523 0.9925431921162 Yes
Calculate only 1000000 2 250 0.058322 8573 0.9925431921162 Yes
Data in & out 10000000 2 25 1.529462 327 0.9992492055877 Yes
Data out only 10000000 2 25 0.880241 568 0.9992492055877 Yes
Calculate only 10000000 2 25 0.038980 12827 0.9992492055877 Yes
Data in & out 100000 8 2500 2.511802 796 0.9571642109917 Yes
Data out only 100000 8 2500 1.519311 1316 0.9571642109917 Yes
Calculate only 100000 8 2500 0.258445 7739 0.9571642109917 Yes
Data in & out 1000000 8 250 1.724740 1160 0.9955252302690 Yes
Data out only 1000000 8 250 0.958417 2087 0.9955252302690 Yes
Calculate only 1000000 8 250 0.059035 33878 0.9955252302690 Yes
Data in & out 10000000 8 25 1.530984 1306 0.9995496465632 Yes
Data out only 10000000 8 25 0.886000 2257 0.9995496465632 Yes
Calculate only 10000000 8 25 0.038900 51414 0.9995496465632 Yes
Data in & out 100000 32 2500 2.545866 3142 0.8903768345465 Yes
Data out only 100000 32 2500 1.577256 5072 0.8903768345465 Yes
Calculate only 100000 32 2500 0.317353 25209 0.8903768345465 Yes
Data in & out 1000000 32 250 1.743791 4588 0.9881014965491 Yes
Data out only 1000000 32 250 0.984076 8129 0.9881014965491 Yes
Calculate only 1000000 32 250 0.083059 96317 0.9881014965491 Yes
Data in & out 10000000 32 25 1.554134 5148 0.9987993043723 Yes
Data out only 10000000 32 25 0.899483 8894 0.9987993043723 Yes
Calculate only 10000000 32 25 0.059315 134873 0.9987993043723 Yes
Extra tests - loop in main CUDA Function
Calculate 10000000 2 25 0.011854 42179 0.9992492055877 Yes
Shared Memory 10000000 2 25 0.007393 67633 0.9992492055877 Yes
Calculate 10000000 8 25 0.020068 99660 0.9995496465632 Yes
Shared Memory 10000000 8 25 0.016334 122444 0.9995496465632 Yes
Calculate 10000000 32 25 0.055090 145217 0.9987993043723 Yes
Shared Memory 10000000 32 25 0.052313 152925 0.9987993043723 Yes
####################################################
getDetails and MHz
Assembler CPUID and RDTSC
CPU GenuineIntel, Features Code BFEBFBFF, Model Code 000306E4
Intel(R) Xeon(R) CPU E5-2620 v2 @ 2.10GHz
Measured - Minimum 2100 MHz, Maximum 2100 MHz
Linux Functions
get_nprocs() - CPUs 24, Configured CPUs 24
get_phys_pages() and size - RAM Size 31.38 GB, Page Size 4096 Bytes
uname() - Linux, *****, 2.6.18-371.6.1.el5
#1 SMP Tue Feb 18 11:42:11 EST 2014, x86_64
##########################################
Linux CUDA 3.2 x64 64 Bits DP MFLOPS Benchmark 1.4 Tue Nov 18 04:54:08 2014
CUDA devices found
Device 0: Tesla M2090 with 16 Processors 128 cores
Global Memory 5249 MB, Shared Memory/Block 49152 B, Max Threads/Block 1024
Using 256 Threads
Test 8 Byte Ops Repeat Seconds MFLOPS First All
Words /Wd Passes Results Same
Data in & out 100000 2 2500 2.463131 203 0.9294744580218 Yes
Data out only 100000 2 2500 1.194109 419 0.9294744580218 Yes
Calculate only 100000 2 2500 0.096801 5165 0.9294744580218 Yes
Data in & out 1000000 2 250 1.774478 282 0.9925431921162 Yes
Data out only 1000000 2 250 0.827365 604 0.9925431921162 Yes
Calculate only 1000000 2 250 0.037820 13220 0.9925431921162 Yes
Data in & out 10000000 2 25 1.858759 269 0.9992492055877 Yes
Data out only 10000000 2 25 0.778925 642 0.9992492055877 Yes
Calculate only 10000000 2 25 0.031232 16009 0.9992492055877 Yes
Data in & out 100000 8 2500 2.462919 812 0.9571642109917 Yes
Data out only 100000 8 2500 1.197777 1670 0.9571642109917 Yes
Calculate only 100000 8 2500 0.105909 18884 0.9571642109917 Yes
Data in & out 1000000 8 250 1.776754 1126 0.9955252302690 Yes
Data out only 1000000 8 250 0.826855 2419 0.9955252302690 Yes
Calculate only 1000000 8 250 0.038410 52070 0.9955252302690 Yes
Data in & out 10000000 8 25 1.857582 1077 0.9995496465632 Yes
Data out only 10000000 8 25 0.779036 2567 0.9995496465632 Yes
Calculate only 10000000 8 25 0.031213 64076 0.9995496465632 Yes
Data in & out 100000 32 2500 2.492104 3210 0.8903768345465 Yes
Data out only 100000 32 2500 1.253178 6384 0.8903768345465 Yes
Calculate only 100000 32 2500 0.161453 49550 0.8903768345465 Yes
Data in & out 1000000 32 250 1.792488 4463 0.9881014965491 Yes
Data out only 1000000 32 250 0.844827 9469 0.9881014965491 Yes
Calculate only 1000000 32 250 0.055455 144261 0.9881014965491 Yes
Data in & out 10000000 32 25 1.873346 4270 0.9987993043723 Yes
Data out only 10000000 32 25 0.793836 10078 0.9987993043723 Yes
Calculate only 10000000 32 25 0.044962 177927 0.9987993043723 Yes
Extra tests - loop in main CUDA Function
Calculate 10000000 2 25 0.011392 43890 0.9992492055877 Yes
Shared Memory 10000000 2 25 0.005632 88780 0.9992492055877 Yes
Calculate 10000000 8 25 0.015511 128940 0.9995496465632 Yes
Shared Memory 10000000 8 25 0.012508 159899 0.9995496465632 Yes
Calculate 10000000 32 25 0.042415 188612 0.9987993043723 Yes
Shared Memory 10000000 32 25 0.040270 198659 0.9987993043723 Yes
How old is the CUDA lib version 4.0.17 ? The version numbering across platforms differs somewhat as far as I know, but this seems really old, since my Windows machine gives the CUDA driver version as 6.5.20 (this is software I installed about a month ago).
Have you had a chance to try profiling to confirm that the slowdown is due to the GPU and not some other component?
The drivers between the two machines are probably different. You may want to make sure both machines have the same GPU driver. This can be inspected from nvidia-smi as well.
SOLVED!
The (3rd-party) test program we’re using it’s a Java test suite that interfaces to CUDA using JNI calls to custom C++ library.
On the well-performing server, we’re using Sun JDK, while on the other one we were using OpenJDK: once aligned both to Sun JDK, performance aligned too.
The most likely cause is that the C++ library are written with Sun JDK in mind, not sure about that, we’re still investigating.