CUDA test performance issue

Hi, I’m brand new to CUDA, so please forgive me for any error.

We’re porting our custom CUDA app to a new server, equipped with Tesla M2090 GPUs: this is the output of nvidia-smi -q command:

GPU 0:3:0
    Product Name                : Tesla M2090
    Display Mode                : Disabled
    Persistence Mode            : Enabled
    Driver Model
        Current                 : N/A
        Pending                 : N/A
    Serial Number               : 0324111048864
    GPU UUID                    : GPU-c6fc1d7fa74cb72d-86faeca1-8177e5ea-2bfd9939-071f0a08b251a4465dfc398f
    Inforom Version
        OEM Object              : 1.1
        ECC Object              : 2.0
        Power Management Object : 4.0
    PCI
        Bus                     : 3
        Device                  : 0
        Domain                  : 0
        Device Id               : 109110DE
        Bus Id                  : 0:3:0
    Fan Speed                   : N/A
    Memory Usage
        Total                   : 5375 Mb
        Used                    : 9 Mb
        Free                    : 5365 Mb
    Compute Mode                : Exclusive_Process
    Utilization
        Gpu                     : 0 %
        Memory                  : 0 %
    Ecc Mode
        Current                 : Enabled
        Pending                 : Enabled
    ECC Errors
        Volatile
            Single Bit
                Device Memory   : 0
                Register File   : 0
                L1 Cache        : 0
                L2 Cache        : 0
                Total           : 0
            Double Bit
                Device Memory   : 0
                Register File   : 0
                L1 Cache        : 0
                L2 Cache        : 0
                Total           : 0
        Aggregate
            Single Bit
                Device Memory   : 0
                Register File   : 0
                L1 Cache        : 0
                L2 Cache        : 0
                Total           : 0
            Double Bit
                Device Memory   : 0
                Register File   : 0
                L1 Cache        : 0
                L2 Cache        : 0
                Total           : 0
    Temperature
        Gpu                     : N/A
    Power Readings
        Power State             : P12
        Power Management        : Supported
        Power Draw              : 32.74 W
        Power Limit             : 225 W
    Clocks
        Graphics                : 50 MHz
        SM                      : 101 MHz
        Memory                  : 135 MHz

This is the output of the same command on the “old” server:

GPU 0000:1F:00.0
    Product Name                : Tesla M2070-Q
    Display Mode                : Disabled
    Persistence Mode            : Enabled
    Driver Model
        Current                 : N/A
        Pending                 : N/A
    Serial Number               : 0324710079394
    GPU UUID                    : GPU-9706f3ef-15ec-8d5a-0a3d-dcc70685820e
    VBIOS Version               : 70.00.4E.00.05
    Inforom Version
        OEM Object              : 1.0
        ECC Object              : 1.0
        Power Management Object : 1.0
    PCI
        Bus                     : 0x1F
        Device                  : 0x00
        Domain                  : 0x0000
        Device Id               : 0x06DF10DE
        Bus Id                  : 0000:1F:00.0
        Sub System Id           : 0x084D10DE
        GPU Link Info
            PCIe Generation
                Max             : 2
                Current         : 2
            Link Width
                Max             : 16x
                Current         : 16x
    Fan Speed                   : N/A
    Performance State           : P8
    Memory Usage
        Total                   : 5375 MB
        Used                    : 10 MB
        Free                    : 5365 MB
    Compute Mode                : Exclusive_Process
    Utilization
        Gpu                     : 0 %
        Memory                  : 0 %
    Ecc Mode
        Current                 : Enabled
        Pending                 : Enabled
    ECC Errors
        Volatile
            Single Bit
                Device Memory   : 0
                Register File   : 0
                L1 Cache        : 0
                L2 Cache        : 0
                Total           : 0
            Double Bit
                Device Memory   : 0
                Register File   : 0
                L1 Cache        : 0
                L2 Cache        : 0
                Total           : 0
        Aggregate
            Single Bit
                Device Memory   : N/A
                Register File   : N/A
                L1 Cache        : N/A
                L2 Cache        : N/A
                Total           : 0
            Double Bit
                Device Memory   : N/A
                Register File   : N/A
                L1 Cache        : N/A
                L2 Cache        : N/A
                Total           : 0
    Temperature
        Gpu                     : N/A
    Power Readings
        Power Management        : N/A
        Power Draw              : N/A
        Power Limit             : N/A
    Clocks
        Graphics                : 270 MHz
        SM                      : 540 MHz
        Memory                  : 1566 MHz
    Max Clocks
        Graphics                : 573 MHz
        SM                      : 1147 MHz
        Memory                  : 1566 MHz
    Compute Processes           : None

The problem we have is, that running a test (a Java application) on the old server takes 25-30 seconds, while the same test on the new one runs in 150-180 seconds, that is 5-6 times slower.

Is there any configuration tweak that we forgot on the way?
Or do we have to re-compile the application?

Thanks a lot!

M2070 and M2090 are both Fermi parts with sm_20 architecture, and the M2090 should be a tad faster.

Given that you appear to be timing at application level, it seems the first thing you would want to do is use a profiler on both CPU and GPU to determine where the additional time is spent. It may have nothing to do with the GPU at all.

FWIW, the differently-formatted output from nvidia-smi could be an indication that different CUDA versions are being run on the two machines. It would probably be best to minimize any differences between the two platforms while you are narrowing down the source of the performance differences.

The nvidia-smi output on both machines shows that persistence mode is enabled, so that is one configuration issue that can be excluded. Check for performance-relevant CUDA environment variables like CUDA_LAUNCH_BLOCKING or CUDA_FORCE_PTX_JIT that may be set.

I would suggest double checking the power connectors and cooling of the M2090 to make sure there is no unwanted power-capping kicking in (although the 5x-6x performance difference seems too large for that to be the root cause).

Thanks njuffa for the quick answer!
I’ve checked on both machines, but the CUDA lib version is the same (4.0.17), at least as far as I can see from the content of /usr/local/cuda/lib64 directory.
Persistent mode enabled and exclusive mode have been set by us on both machines, so this is something that we can exclude as a cause of problem.
Also, I’ve found no CUDA* environment variables.

I’ve run the benchmark you can find here http://www.roylongbottom.org.uk/linux_cuda_mflops.htm and the results show a little improvement on the new server, where the M2090 has been installed:

####################################################
  getDetails and MHz


  Assembler CPUID and RDTSC
  CPU GenuineIntel, Features Code BFEBFBFF, Model Code 000206C2
  Intel(R) Xeon(R) CPU           E5649  @ 2.53GHz
  Measured - Minimum 2533 MHz, Maximum 2533 MHz
  Linux Functions
  get_nprocs() - CPUs 24, Configured CPUs 24
  get_phys_pages() and size - RAM Size 94.38 GB, Page Size 4096 Bytes
  uname() - Linux, *****, 2.6.18-371.4.1.el5
  #1 SMP Wed Jan 8 18:42:07 EST 2014, x86_64

 ##########################################

  Linux CUDA 3.2 x64 64 Bits DP MFLOPS Benchmark 1.4 Tue Nov 18 12:43:31 2014


  CUDA devices found
  Device 0: Tesla M2070-Q  with 14 Processors 112 cores
  Global Memory 5249 MB, Shared Memory/Block 49152 B, Max Threads/Block 1024

  Using 256 Threads

  Test            8 Byte  Ops  Repeat   Seconds   MFLOPS             First  All
                   Words  /Wd  Passes                              Results Same

 Data in & out    100000    2    2500  2.807184      178   0.9294744580218  Yes
 Data out only    100000    2    2500  1.731228      289   0.9294744580218  Yes
 Calculate only   100000    2    2500  0.249827     2001   0.9294744580218  Yes

 Data in & out   1000000    2     250  1.712072      292   0.9925431921162  Yes
 Data out only   1000000    2     250  0.955676      523   0.9925431921162  Yes
 Calculate only  1000000    2     250  0.058322     8573   0.9925431921162  Yes

 Data in & out  10000000    2      25  1.529462      327   0.9992492055877  Yes
 Data out only  10000000    2      25  0.880241      568   0.9992492055877  Yes
 Calculate only 10000000    2      25  0.038980    12827   0.9992492055877  Yes

 Data in & out    100000    8    2500  2.511802      796   0.9571642109917  Yes
 Data out only    100000    8    2500  1.519311     1316   0.9571642109917  Yes
 Calculate only   100000    8    2500  0.258445     7739   0.9571642109917  Yes

 Data in & out   1000000    8     250  1.724740     1160   0.9955252302690  Yes
 Data out only   1000000    8     250  0.958417     2087   0.9955252302690  Yes
 Calculate only  1000000    8     250  0.059035    33878   0.9955252302690  Yes

 Data in & out  10000000    8      25  1.530984     1306   0.9995496465632  Yes
 Data out only  10000000    8      25  0.886000     2257   0.9995496465632  Yes
 Calculate only 10000000    8      25  0.038900    51414   0.9995496465632  Yes

 Data in & out    100000   32    2500  2.545866     3142   0.8903768345465  Yes
 Data out only    100000   32    2500  1.577256     5072   0.8903768345465  Yes
 Calculate only   100000   32    2500  0.317353    25209   0.8903768345465  Yes

 Data in & out   1000000   32     250  1.743791     4588   0.9881014965491  Yes
 Data out only   1000000   32     250  0.984076     8129   0.9881014965491  Yes
 Calculate only  1000000   32     250  0.083059    96317   0.9881014965491  Yes

 Data in & out  10000000   32      25  1.554134     5148   0.9987993043723  Yes
 Data out only  10000000   32      25  0.899483     8894   0.9987993043723  Yes
 Calculate only 10000000   32      25  0.059315   134873   0.9987993043723  Yes

 Extra tests - loop in main CUDA Function

 Calculate      10000000    2      25  0.011854    42179   0.9992492055877  Yes
 Shared Memory  10000000    2      25  0.007393    67633   0.9992492055877  Yes

 Calculate      10000000    8      25  0.020068    99660   0.9995496465632  Yes
 Shared Memory  10000000    8      25  0.016334   122444   0.9995496465632  Yes

 Calculate      10000000   32      25  0.055090   145217   0.9987993043723  Yes
 Shared Memory  10000000   32      25  0.052313   152925   0.9987993043723  Yes
####################################################
  getDetails and MHz


  Assembler CPUID and RDTSC
  CPU GenuineIntel, Features Code BFEBFBFF, Model Code 000306E4
        Intel(R) Xeon(R) CPU E5-2620 v2 @ 2.10GHz
  Measured - Minimum 2100 MHz, Maximum 2100 MHz
  Linux Functions
  get_nprocs() - CPUs 24, Configured CPUs 24
  get_phys_pages() and size - RAM Size 31.38 GB, Page Size 4096 Bytes
  uname() - Linux, *****, 2.6.18-371.6.1.el5
  #1 SMP Tue Feb 18 11:42:11 EST 2014, x86_64

 ##########################################

  Linux CUDA 3.2 x64 64 Bits DP MFLOPS Benchmark 1.4 Tue Nov 18 04:54:08 2014


  CUDA devices found
  Device 0: Tesla M2090  with 16 Processors 128 cores
  Global Memory 5249 MB, Shared Memory/Block 49152 B, Max Threads/Block 1024

  Using 256 Threads

  Test            8 Byte  Ops  Repeat   Seconds   MFLOPS             First  All
                   Words  /Wd  Passes                              Results Same

 Data in & out    100000    2    2500  2.463131      203   0.9294744580218  Yes
 Data out only    100000    2    2500  1.194109      419   0.9294744580218  Yes
 Calculate only   100000    2    2500  0.096801     5165   0.9294744580218  Yes

 Data in & out   1000000    2     250  1.774478      282   0.9925431921162  Yes
 Data out only   1000000    2     250  0.827365      604   0.9925431921162  Yes
 Calculate only  1000000    2     250  0.037820    13220   0.9925431921162  Yes

 Data in & out  10000000    2      25  1.858759      269   0.9992492055877  Yes
 Data out only  10000000    2      25  0.778925      642   0.9992492055877  Yes
 Calculate only 10000000    2      25  0.031232    16009   0.9992492055877  Yes

 Data in & out    100000    8    2500  2.462919      812   0.9571642109917  Yes
 Data out only    100000    8    2500  1.197777     1670   0.9571642109917  Yes
 Calculate only   100000    8    2500  0.105909    18884   0.9571642109917  Yes

 Data in & out   1000000    8     250  1.776754     1126   0.9955252302690  Yes
 Data out only   1000000    8     250  0.826855     2419   0.9955252302690  Yes
 Calculate only  1000000    8     250  0.038410    52070   0.9955252302690  Yes

 Data in & out  10000000    8      25  1.857582     1077   0.9995496465632  Yes
 Data out only  10000000    8      25  0.779036     2567   0.9995496465632  Yes
 Calculate only 10000000    8      25  0.031213    64076   0.9995496465632  Yes

 Data in & out    100000   32    2500  2.492104     3210   0.8903768345465  Yes
 Data out only    100000   32    2500  1.253178     6384   0.8903768345465  Yes
 Calculate only   100000   32    2500  0.161453    49550   0.8903768345465  Yes

 Data in & out   1000000   32     250  1.792488     4463   0.9881014965491  Yes
 Data out only   1000000   32     250  0.844827     9469   0.9881014965491  Yes
 Calculate only  1000000   32     250  0.055455   144261   0.9881014965491  Yes

 Data in & out  10000000   32      25  1.873346     4270   0.9987993043723  Yes
 Data out only  10000000   32      25  0.793836    10078   0.9987993043723  Yes
 Calculate only 10000000   32      25  0.044962   177927   0.9987993043723  Yes

 Extra tests - loop in main CUDA Function

 Calculate      10000000    2      25  0.011392    43890   0.9992492055877  Yes
 Shared Memory  10000000    2      25  0.005632    88780   0.9992492055877  Yes

 Calculate      10000000    8      25  0.015511   128940   0.9995496465632  Yes
 Shared Memory  10000000    8      25  0.012508   159899   0.9995496465632  Yes

 Calculate      10000000   32      25  0.042415   188612   0.9987993043723  Yes
 Shared Memory  10000000   32      25  0.040270   198659   0.9987993043723  Yes

Many thanks again!

Could you try running bandwidthTest from the samples folder and see what that says?

I’ve managed to run the bandwidthTest tool on both servers, and here are the results:

For the M2090:

[bandwidthTest] starting...

Running on...

 Device 0: Tesla M2090
 Quick Mode

 Host to Device Bandwidth, 1 Device(s), Paged memory
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     2736.4

 Device to Host Bandwidth, 1 Device(s), Paged memory
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     2474.3

 Device to Device Bandwidth, 1 Device(s)
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     141302.8

[bandwidthTest] test results...
PASSED

For the M2070Q:

[bandwidthTest] Starting...

Running on...

 Device 0: Tesla M2070-Q
 Quick Mode

 Host to Device Bandwidth, 1 Device(s), Paged memory
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     2956.3

 Device to Host Bandwidth, 1 Device(s), Paged memory
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     2143.8

 Device to Device Bandwidth, 1 Device(s)
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     102680.5

[bandwidthTest] test results...
PASSED

Press ENTER to exit...

For what I can see, both the GPUs are performing quite at the same bandwidth speed, so I think the issue is not here. Thanks for the hint, anyway.

How old is the CUDA lib version 4.0.17 ? The version numbering across platforms differs somewhat as far as I know, but this seems really old, since my Windows machine gives the CUDA driver version as 6.5.20 (this is software I installed about a month ago).

Have you had a chance to try profiling to confirm that the slowdown is due to the GPU and not some other component?

The drivers between the two machines are probably different. You may want to make sure both machines have the same GPU driver. This can be inspected from nvidia-smi as well.

SOLVED!
The (3rd-party) test program we’re using it’s a Java test suite that interfaces to CUDA using JNI calls to custom C++ library.
On the well-performing server, we’re using Sun JDK, while on the other one we were using OpenJDK: once aligned both to Sun JDK, performance aligned too.
The most likely cause is that the C++ library are written with Sun JDK in mind, not sure about that, we’re still investigating.

Thanks to all for your support!