Poor Memcpy Performance Copying To Pinned Memory On Host

I have run into an issue on our new K20x cards when transferring data from host to device that i have not seen on my K1000m on my laptop.

The problem

I am seeing poor performance when copying ~500MB from a non-pinned host buffer to a pinned host buffer. This ~500MB takes ~300ms to complete.

This strategy i have been told is called “staging”. Where you setup a pinned buffer and copy your data into the pinned buffer, then copy the data from the pinned buffer down to the device (and vise versa).

Copying from pinned memory to the device is very quick, completing in ~80ms for ~500MB.

Copying from device to pinned memory exhibits the same performance as copying pinned to device, ~80ms for ~500MB.

Copying from pinned memory to a non-pinned buffer takes ~200ms (a bit faster than the other direction) to complete for ~500MB.

Things that didn’t work

  • I enabled persistent mode for the CUDA drivers: sudo nvidia-smi -pm 1. No change
  • I have 3x K20x cards, i have tried using cudaSetDevice() and running the program on each of the three cards. No change.
  • I tried using both cudaMemcpy(, cudaMemcpyHostToHost) and regular C memcpy(). No change.
  • I tried registering the unpinned host buffers

Odd solution

  • I tried registering/unregistering the unpinned host buffers into pinned buffers. This takes about 15ms tp regoster each buffer and ~20ms to unregister each buffer. The memcpy then ran just as fast as the pinned memcpy above ~80ms. This reduced the total time of the memcpy operation significantly without using the "staging" buffers.

Question(s)

  • I was told that register/unregister is was slow and should not used. Instead "staging" buffers are the preferred mechanism for memcpy between host/device. Why is this the case?
  • Why am i seeing better performance by registering/unregistering? And is this normal?
  • Should i be doing this a different way?

System Specs

$ sudo nvidia-smi                                                          
Mon Mar 31 23:47:25 2014                                                                     
+------------------------------------------------------+                                     
| NVIDIA-SMI 331.38     Driver Version: 331.38         |                                     
|-------------------------------+----------------------+----------------------+              
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |              
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |              
|===============================+======================+======================|              
|   0  Tesla K20Xm         On   | 0000:03:00.0     Off |                    0 |              
| N/A   28C    P8    17W / 235W |     13MiB /  5759MiB |      0%      Default |              
+-------------------------------+----------------------+----------------------+
|   1  Tesla K20Xm         On   | 0000:04:00.0     Off |                    0 |              
| N/A   29C    P8    17W / 235W |     13MiB /  5759MiB |      0%      Default |              
+-------------------------------+----------------------+----------------------+              
|   2  Tesla K20Xm         On   | 0000:84:00.0     Off |                    0 |              
| N/A   25C    P8    18W / 235W |     13MiB /  5759MiB |      0%      Default |              
+-------------------------------+----------------------+----------------------+
$ cat /proc/cpuinfo                                                   
processor       : 0                                                                          
vendor_id       : GenuineIntel                                                               
cpu family      : 6                                                                          
model           : 45                                                                         
model name      : Intel(R) Xeon(R) CPU E5-2680 0 @ 2.70GHz                                   
stepping        : 7                                                                          
cpu MHz         : 1200.000                                                                   
cache size      : 20480 KB                                                                   
physical id     : 0                                                                          
siblings        : 16                                                                         
core id         : 0                                                                          
cpu cores       : 8                                                                          
apicid          : 0                                                                          
initial apicid  : 0                                                                          
fpu             : yes                                                                        
fpu_exception   : yes                                                                        
cpuid level     : 13                                                                         
wp              : yes                                                                        
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm ida arat epb xsaveopt pln pts dts tpr_shadow vnmi flexpriority ept vpid                                                                                            
bogomips        : 5400.27                                                                    
clflush size    : 64                                                                         
cache_alignment : 64                                                                         
address sizes   : 46 bits physical, 48 bits virtual                                          
power management:

I have been under the impression due to some tests a few years ago that there is no advantage to staging. In particular, it was faster for me to copy from unpinned->device than from unpinned->pinned->device. (I believe that the driver or toolkit is staging into its own pinned memory if your memory is not pinned, but I have never confirmed that and do not know the granularity of that staging.)

Have you timed a regular unpinned->device transfer? For your case, I would expect that unpinned->device transfers would take roughly 160ms for 500 MB. If so, that would be faster than staging.

I am surprised by how quickly you were able to register memory. I had tested that approach in the past and the register/unregister cycle was slow enough so that transferring from unpinned->device was still faster, but that does not seem to be the case for you. Did you register a block of memory that had all been initialized, or was it uninitialized? My guess is that something about the test is not quite realistic (e.g., maybe the buffer is zero-initialized and all pages map to a single zero page), but I would be happy to hear that register/unregister is now that fast.

I could imagine cases where staging offered an advantage because it allowed asynchronous memory transfers when using streams. However, if you are not using streams and asynchronous transfers, and cannot use pinned memory without staging, then I would just recommend making sure that the unpinned memory is on the same NUMA node as the target GPU (given that you have 3 K20s, I would guess that this is a dual socket board with two NUMA nodes) and perform unpinned->device transfers from that NUMA-local memory.

tbenson,

I have tried performing unpinned->device memcpy and achieved the same results as unpinned->pinned host memcpy ~300ms for ~500MB.

The memory that i register has already been initialized and is filled with data just before entering my cuda function which performs the registration and memcpy.

My plan WAS to move my code to async copies and kernel executions using the staging memory. However, with the poor performance of memcpy between unpinned->pinned memory it takes longer to perform these memcpy’s than it does for me to register, memcpy, execute kernel, memcpy, unregister in a serial fashion, I would still like to do this if i can figure out why the unpinned->pinned memcpy is so slow.

How can i verify that my memory is on the same NUMA node as my target GPU? I tried running my code on each GPU by executing cudaSetDevice() for all 3 GPUs.

sixgig,

You can get the PCIe bus ID from either nvidia-smi or from lspci -vvv. If the K20 has bus ID 0000:03, for example, then you can find the NUMA node for that device in /sys/class/pci_bus/0000:03/device/numa_node. You can also check local_cpulist in that same directory to get a list of CPU cores that are local to the NUMA node. Then, you can run your test code using taskset with the CPU mask corresponding to one or more of the cores in the local node to ensure that memory allocations will be local to the GPU with that bus ID. For example, if your local cpus are 0-7, then running:

taskset 0x2 ./your_test_code

will run the test code on CPU core ID 1. Your performance for unpinned to device transfers is only ~1.6GB/s, which is pretty low. What results do you get running bandwidthTest from the toolkit samples?

I have written a test program to check my regular memcpy performance and to test out NUMA bindings using numactl.

The code creates a 1GB array, memsets it, then creates another 1GB array and performs a memcpy between them. Timings of the different steps are printed out to stdout.

To build:

$ mkdir build
$ cd build
$ cmake ..
$ make

To run:

$ ./big_memcpy_test

Test Code:
main.cpp

#include <chrono>
#include <cstring>
#include <iostream>
#include <cstdint>

class Timer
{
 public:
  Timer()
      : mStart(),
        mStop()
  {
    update();
  }

  void update()
  {
    mStart = std::chrono::high_resolution_clock::now();
    mStop  = mStart;
  }

  double elapsedMs()
  {
    mStop = std::chrono::high_resolution_clock::now();
    std::chrono::milliseconds elapsed_ms =
        std::chrono::duration_cast<std::chrono::milliseconds>(mStop - mStart);
    return elapsed_ms.count();
  }

 private:
  std::chrono::high_resolution_clock::time_point mStart;
  std::chrono::high_resolution_clock::time_point mStop;
};

std::string formatBytes(std::uint64_t bytes)
{
  static const int num_suffix = 5;
  static const char* suffix[num_suffix] = { "B", "KB", "MB", "GB", "TB" };
  double dbl_s_byte = bytes;
  int i = 0;
  for (; (int)(bytes / 1024.) > 0 && i < num_suffix;
       ++i, bytes /= 1024.)
  {
    dbl_s_byte = bytes / 1024.0;
  }

  const int buf_len = 64;
  char buf[buf_len];

  // use snprintf so there is no buffer overrun
  int res = snprintf(buf, buf_len,"%0.2f%s", dbl_s_byte, suffix[i]);
  
  // snprintf returns number of characters that would have been written if n had
  //       been sufficiently large, not counting the terminating null character.
  //       if an encoding error occurs, a negative number is returned.
  if (res >= 0)
  {
    return std::string(buf);
  }
  return std::string();
}

int main(int argc, char* argv[])
{
  std::uint64_t SIZE_BYTES = 1073741824; // 1GB
  
  if (argc > 1)
  {
    SIZE_BYTES = std::stoull(argv[1]);
    std::cout << "Using buffer size from command line: " << formatBytes(SIZE_BYTES)
              << std::endl;
  }
  else
  {
    std::cout << "To specify a custom buffer size: big_memcpy_test  \n"
              << "Using built in buffer size: " << formatBytes(SIZE_BYTES)
              << std::endl;
  }
  

  // big array to use for testing
  char* p_big_array = NULL;

  /////////////
  // malloc 
  {
    Timer timer;
  
    p_big_array = (char*)malloc(SIZE_BYTES * sizeof(char));
    if (p_big_array == NULL)
    {
      std::cerr << "ERROR: malloc of " << SIZE_BYTES << " returned NULL!"
                << std::endl;
      return 1;
    }
    
    std::cout << "malloc for " << formatBytes(SIZE_BYTES) << " took "
              << timer.elapsedMs() << "ms"
              << std::endl;
  }
  
  /////////////
  // memset
  {
    Timer timer;

    // set all data in p_big_array to 0
    memset(p_big_array, 0, SIZE_BYTES * sizeof(char));

    double elapsed_ms = timer.elapsedMs();
    std::cout << "memset for " << formatBytes(SIZE_BYTES) << " took "
              << elapsed_ms << "ms "
              << "(" << formatBytes(SIZE_BYTES / (elapsed_ms / 1.0e3)) << " bytes/sec)"
              << std::endl;
  }

  /////////////
  // memcpy 
  {
    char* p_dest_array = (char*)malloc(SIZE_BYTES);
    if (p_dest_array == NULL)
    {
      std::cerr << "ERROR: malloc of " << SIZE_BYTES << " for memcpy test"
                << " returned NULL!"
                << std::endl;
      return 1;
    }
    memset(p_dest_array, 0, SIZE_BYTES * sizeof(char));

    // time only the memcpy FROM p_big_array TO p_dest_array
    Timer timer;

    memcpy(p_dest_array, p_big_array, SIZE_BYTES * sizeof(char));
    
    double elapsed_ms = timer.elapsedMs();
    std::cout << "memcpy for " << formatBytes(SIZE_BYTES) << " took "
              << elapsed_ms << "ms "
              << "(" << formatBytes(SIZE_BYTES / (elapsed_ms / 1.0e3)) << " bytes/sec)"
              << std::endl;

    // cleanup p_dest_array
    free(p_dest_array);
    p_dest_array = NULL;
  }

  // cleanup
  free(p_big_array);
  p_big_array = NULL;
  
  return 0;
}

CMakeLists.txt

project(big_memcpy_test)
cmake_minimum_required(VERSION 2.4.0)

include_directories(${CMAKE_CURRENT_SOURCE_DIR})

# create verbose makefiles that show each command line as it is issued
set( CMAKE_VERBOSE_MAKEFILE ON CACHE BOOL "Verbose" FORCE )
# release mode
set( CMAKE_BUILD_TYPE Release )
# grab in CXXFLAGS environment variable and append C++11 and -Wall options
set( CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -std=c++0x -Wall" )
message( INFO "CMAKE_CXX_FLAGS = ${CMAKE_CXX_FLAGS}" )

# sources to build
set(big_memcpy_test_SRCS
  main.cpp
)

# create an executable file named "big_memcpy_test" from
# the source files in the variable "big_memcpy_test_SRCS".
add_executable(big_memcpy_test ${big_memcpy_test_SRCS})

I have compared timings of this program on my laptop to our server.

Results:

Buffer Size: 1GB | malloc (ms) | memset (ms) | memcpy (ms) | NUMA nodes (numactl --hardware)
---------------------------------------------------------------------------------------------
My Laptop	 | 0           | 127         | 113         | 1
server113        | 0           | 306         | 301         | 2

As you can see my laptop’s performance is much faster than our brand new servers.

I tried specifying NUMA affinity for both CPU and memory using the following:

$ numactl --cpunodebind=0 --membind=0 ./big_memcpy_test

Specifying NUMA affinity did not improve the results.

Operating System

$ cat /etc/redhat-release
Scientific Linux release 6.5 (Carbon) 
$ uname -a                      
Linux r113 2.6.32-431.1.2.el6.x86_64 #1 SMP Thu Dec 12 13:59:19 CST 2013 x86_64 x86_64 x86_64 GNU/Linux

GPU Hardware

$ nvidia-smi
nvidia-smi                    
Tue Apr  1 12:25:54 2014                                                                     
+------------------------------------------------------+                                     
| NVIDIA-SMI 331.38     Driver Version: 331.38         |                                     
|-------------------------------+----------------------+----------------------+              
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |              
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |              
|===============================+======================+======================|              
|   0  Tesla K20Xm         On   | 0000:03:00.0     Off |                    0 |              
| N/A   24C    P8    18W / 235W |     13MiB /  5759MiB |      0%      Default |              
+-------------------------------+----------------------+----------------------+              
|   1  Tesla K20Xm         On   | 0000:04:00.0     Off |                    0 |              
| N/A   30C    P8    18W / 235W |     13MiB /  5759MiB |      0%      Default |              
+-------------------------------+----------------------+----------------------+              
|   2  Tesla K20Xm         On   | 0000:84:00.0     Off |                    0 |              
| N/A   24C    P8    17W / 235W |     13MiB /  5759MiB |      0%      Default |              
+-------------------------------+----------------------+----------------------+

NUMA Nodes

#### Device 0 ####
$ cat /sys/class/pci_bus/0000\:03/device/numa_node             
0

#### Device 1 ####
$ cat /sys/class/pci_bus/0000\:04/device/numa_node             
0

#### Device 2 ####
$ cat /sys/class/pci_bus/0000\:84/device/numa_node             
1

NUMA Hardware

$ numactl --hardware            
available: 2 nodes (0-1)                                                                     
node 0 cpus: 0 1 2 3 4 5 6 7 16 17 18 19 20 21 22 23                                         
node 0 size: 65501 MB                                                                        
node 0 free: 62607 MB                                                                        
node 1 cpus: 8 9 10 11 12 13 14 15 24 25 26 27 28 29 30 31                                   
node 1 size: 65536 MB                                                                        
node 1 free: 63857 MB                                                                        
node distances:                                                                              
node   0   1                                                                                 
  0:  10  21                                                                                 
  1:  21  10

bandwidthTest
#### Device 0 ####

$ ./bandwidthTest --device=0 --memory=pageable                                                                               
[CUDA Bandwidth Test] - Starting...                                                          
Running on...                                                                                
                                                                                             
 Device 0: Tesla K20Xm                                                                       
 Quick Mode                                                                                  
                                                                                             
 Host to Device Bandwidth, 1 Device(s)                                                       
 PAGEABLE Memory Transfers                                                                   
   Transfer Size (Bytes)        Bandwidth(MB/s)                                              
   33554432                     2758.3                                                       
                                                                                             
 Device to Host Bandwidth, 1 Device(s)                                                       
 PAGEABLE Memory Transfers                                                                   
   Transfer Size (Bytes)        Bandwidth(MB/s)                                              
   33554432                     2503.8                                                       
                                                                                             
 Device to Device Bandwidth, 1 Device(s)                                                     
 PAGEABLE Memory Transfers                                                                   
   Transfer Size (Bytes)        Bandwidth(MB/s)                                              
   33554432                     169482.9                                                     
                                                                                             
Result = PASS 

#### CPU Affinity ####

taskset 0x2 ./bandwidthTest --device=0 --memory=pageable                                          
[CUDA Bandwidth Test] - Starting...                                                          
Running on...                                                                                
                                                                                             
 Device 0: Tesla K20Xm                                                                       
 Quick Mode                                                                                  
                                                                                             
 Host to Device Bandwidth, 1 Device(s)                                                       
 PAGEABLE Memory Transfers                                                                   
   Transfer Size (Bytes)        Bandwidth(MB/s)                                              
   33554432                     2757.8                                                       
                                                                                             
 Device to Host Bandwidth, 1 Device(s)                                                       
 PAGEABLE Memory Transfers                                                                   
   Transfer Size (Bytes)        Bandwidth(MB/s)                                              
   33554432                     2708.0                                                       
                                                                                             
 Device to Device Bandwidth, 1 Device(s)                                                     
 PAGEABLE Memory Transfers                                                                   
   Transfer Size (Bytes)        Bandwidth(MB/s)                                              
   33554432                     169485.8                                                     
                                                                                             
Result = PASS

#### Device 1 ####

$ ./bandwidthTest --device=1 --memory=pageable                                                                   
[CUDA Bandwidth Test] - Starting...                                                          
Running on...                                                                                
                                                                                             
 Device 1: Tesla K20Xm                                                                       
 Quick Mode                                                                                  
                                                                                             
 Host to Device Bandwidth, 1 Device(s)                                                       
 PAGEABLE Memory Transfers                                                                   
   Transfer Size (Bytes)        Bandwidth(MB/s)                                              
   33554432                     2713.5                                                       
                                                                                             
 Device to Host Bandwidth, 1 Device(s)                                                       
 PAGEABLE Memory Transfers                                                                   
   Transfer Size (Bytes)        Bandwidth(MB/s)                                              
   33554432                     2427.6                                                       
                                                                                             
 Device to Device Bandwidth, 1 Device(s)                                                     
 PAGEABLE Memory Transfers                                                                   
   Transfer Size (Bytes)        Bandwidth(MB/s)                                              
   33554432                     169415.4                                                     
                                                                                             
Result = PASS

#### CPU Affinity ####

$ taskset 0x2 ./bandwidthTest --device=1 --memory=pageable                                                                   
[CUDA Bandwidth Test] - Starting...                                                          
Running on...                                                                                
                                                                                             
 Device 1: Tesla K20Xm                                                                       
 Quick Mode                                                                                  
                                                                                             
 Host to Device Bandwidth, 1 Device(s)                                                       
 PAGEABLE Memory Transfers                                                                   
   Transfer Size (Bytes)        Bandwidth(MB/s)                                              
   33554432                     2732.7                                                       
                                                                                             
 Device to Host Bandwidth, 1 Device(s)                                                       
 PAGEABLE Memory Transfers                                                                   
   Transfer Size (Bytes)        Bandwidth(MB/s)                                              
   33554432                     2540.0                                                       
                                                                                             
 Device to Device Bandwidth, 1 Device(s)                                                     
 PAGEABLE Memory Transfers                                                                   
   Transfer Size (Bytes)        Bandwidth(MB/s)                                              
   33554432                     169583.5                                                     
                                                                                             
Result = PASS

#### Device 2 ####

$ ./bandwidthTest --device=2 --memory=pageable                                                                              
[CUDA Bandwidth Test] - Starting...                                                          
Running on...                                                                                
                                                                                             
 Device 2: Tesla K20Xm                                                                       
 Quick Mode                                                                                  
                                                                                             
 Host to Device Bandwidth, 1 Device(s)                                                       
 PAGEABLE Memory Transfers                                                                   
   Transfer Size (Bytes)        Bandwidth(MB/s)                                              
   33554432                     2749.2                                                       
                                                                                             
 Device to Host Bandwidth, 1 Device(s)                                                       
 PAGEABLE Memory Transfers                                                                   
   Transfer Size (Bytes)        Bandwidth(MB/s)                                              
   33554432                     2943.6                                                       
                                                                                             
 Device to Device Bandwidth, 1 Device(s)                                                     
 PAGEABLE Memory Transfers                                                                   
   Transfer Size (Bytes)        Bandwidth(MB/s)                                              
   33554432                     169896.1                                                     
                                                                                             
Result = PASS

#### CPU Affinity ####
# Note used 0x8 because this GPU device is on NUMA node 1 along with CPU 8

$ taskset 0x8 ./bandwidthTest --device=2 --memory=pageable                                                                   
[CUDA Bandwidth Test] - Starting...                                                          
Running on...                                                                                
                                                                                             
 Device 1: Tesla K20Xm                                                                       
 Quick Mode                                                                                  
                                                                                             
 Host to Device Bandwidth, 1 Device(s)                                                       
 PAGEABLE Memory Transfers                                                                   
   Transfer Size (Bytes)        Bandwidth(MB/s)                                              
   33554432                     2834.0                                                       
                                                                                             
 Device to Host Bandwidth, 1 Device(s)                                                       
 PAGEABLE Memory Transfers                                                                   
   Transfer Size (Bytes)        Bandwidth(MB/s)                                              
   33554432                     2950.1                                                       
                                                                                             
 Device to Device Bandwidth, 1 Device(s)                                                     
 PAGEABLE Memory Transfers                                                                   
   Transfer Size (Bytes)        Bandwidth(MB/s)                                              
   33554432                     169990.0                                                     
                                                                                             
Result = PASS

Looks like poor performance for host->device and device->host for pageable memory.

Edit: added OS version

What is the host->device and device->host throughput reported by bandwidthTest with pinned memory? Using STREAM or a similar benchmark, what host->host copy throughput is reported?

From the data shown so far it appears that the root cause could be unrelated to the GPU or CUDA. It seems to be a case of slow host-to-host copies, which in turn affects the performance of copies to and from the GPU using pageable host memory, as these involve an intermediate host-to-host transfer to / from a pinned host buffer maintained by the CUDA driver.

Pinned Bandwidth Test

#### Device 0 ####
$ ./bandwidthTest --device=0 --memory=pinned                                                                                 
[CUDA Bandwidth Test] - Starting...                                                          
Running on...                                                                                
                                                                                             
 Device 0: Tesla K20Xm                                                                       
 Quick Mode                                                                                  
                                                                                             
 Host to Device Bandwidth, 1 Device(s)                                                       
 PINNED Memory Transfers                                                                     
   Transfer Size (Bytes)        Bandwidth(MB/s)                                              
   33554432                     5881.5                                                       
                                                                                             
 Device to Host Bandwidth, 1 Device(s)                                                       
 PINNED Memory Transfers                                                                     
   Transfer Size (Bytes)        Bandwidth(MB/s)                                              
   33554432                     6383.9                                                       
                                                                                             
 Device to Device Bandwidth, 1 Device(s)                                                     
 PINNED Memory Transfers                                                                     
   Transfer Size (Bytes)        Bandwidth(MB/s)                                              
   33554432                     169536.1                                                     
                                                                                             
Result = PASS

#### Device 1 ####
$ ./bandwidthTest --device=1 --memory=pinned                                                                                
[CUDA Bandwidth Test] - Starting...                                                          
Running on...                                                                                
                                                                                             
 Device 1: Tesla K20Xm                                                                       
 Quick Mode                                                                                  
                                                                                             
 Host to Device Bandwidth, 1 Device(s)                                                       
 PINNED Memory Transfers                                                                     
   Transfer Size (Bytes)        Bandwidth(MB/s)                                              
   33554432                     5763.2                                                       
                                                                                             
 Device to Host Bandwidth, 1 Device(s)                                                       
 PINNED Memory Transfers                                                                     
   Transfer Size (Bytes)        Bandwidth(MB/s)                                              
   33554432                     6399.8                                                       
                                                                                             
 Device to Device Bandwidth, 1 Device(s)                                                     
 PINNED Memory Transfers                                                                     
   Transfer Size (Bytes)        Bandwidth(MB/s)                                              
   33554432                     169546.1                                                     
                                                                                             
Result = PASS

#### Device 2 ####
$ ./bandwidthTest --device=2 --memory=pinned                                                                                
[CUDA Bandwidth Test] - Starting...                                                          
Running on...                                                                                
                                                                                             
 Device 2: Tesla K20Xm                                                                       
 Quick Mode                                                                                  
                                                                                             
 Host to Device Bandwidth, 1 Device(s)                                                       
 PINNED Memory Transfers                                                                     
   Transfer Size (Bytes)        Bandwidth(MB/s)                                              
   33554432                     5727.9                                                       
                                                                                             
 Device to Host Bandwidth, 1 Device(s)                                                       
 PINNED Memory Transfers                                                                     
   Transfer Size (Bytes)        Bandwidth(MB/s)                                              
   33554432                     6392.4                                                       
                                                                                             
 Device to Device Bandwidth, 1 Device(s)                                                     
 PINNED Memory Transfers                                                                     
   Transfer Size (Bytes)        Bandwidth(MB/s)                                              
   33554432                     169478.6                                                     
                                                                                             
Result = PASS

njuffa, i am thinking you are correct. I’m going to be duplicating my question on StackOverflow for further help.

@sixgig, the earlier H>D and D>H numbers look like standard PCIe 2.0 x8 bandwidth results.

Ah, but the “–pinned” results appear to be PCIe 2.0 x16.

They’re a little slower than I would’ve expected.

But as @njuffa pointed out, this probably has little to do with your H>H issue.

These transfer rates with pinned memory are in the expected range for an PCIe gen 2 x16 interface, i.e. around 6 GB/sec. Since the throughput when using pageable memory is substantially lower, and your original post indicates that host-to-host copies take 2.5x the time of device<->host copies it seems the overall transfer rate with pageable host memory is completely limited by the performance of host->host copies.

I have no insights as to why host->host copies would be this slow, especially as the system appears to be of quite recent vintage based on the type of CPU used. As a sanity check, have you tried different transfer sizes to see whether you may be hitting a pathological case (TLB thrashing etc).

[Later:] Given that this is a server, there aren’t by any chance other processes running on this machine that could be hammering the system memory?

[Even later:] You may want to review the system BIOS settings related to system memory.

As a quick test, try changing the memcpy() to a memmove().

I have a similar host available (CentOS 6.5, dual E5-2650 v2 Xeons) and see similar lower-than-expected memcpy performance. From what I can tell, this seems related to memcpy from libc and not the hardware. The STREAM benchmark reports ~13GB/s for copies and memtest86 reports near the expected DRAM bandwidth (I haven’t run it recently, but recall it being reasonable). Switching from memcpy() to memmove() nearly doubled the performance on my CentOS 6.5 system, so I suspect there is a performance issue with the particular implementation of memset/memcpy used. Other systems that I tested (Fedora 19, Ubuntu 12.04) did not have this issue. I will poke at it some more later this evening.

I am puzzled by this observation. memmove() allows for possible overlap while memcpy() assumes no overlap. So a memcpy() implementation has fewer restrictions placed upon it and should never be slower, and can potentially be faster, than memmove().

I would expect a four-channel server memory subsystem using DDR3 to provide between 9 GB/sec and 15 GB/sec of theoretical bandwidth per channel, depending on memory frequency. Since a copy operation involves both read and write streams, that would translate to roughly 14 GB/sec to 24 GB/sec of copy throughput if one assumes 80% efficiency vs theoretical.

Best results for multi-socket systems require controlling for NUMA effects, but per an earlier post in this thread the original poster already tried numactl to bind CPU and memory and it didn’t help.

@allanmac The sad part is that this is a brand new system with PCI-e 3.0

Chassis/Mobo: SuperMicro 1027GR-TRF (http://www.supermicro.com/products/system/1U/1027/SYS-1027GR-TRF.cfm)

@njuffa I have these servers to myself and there is nothing else running on them. I am planning on reviewing BIOS settings, unfortunately these servers are physically located in a different state so i will have to wait until i visit our lab to review the settings.

@tbenson Interesting results: Initially changing memcpy to memmove produced the same results when built in release mode ~300ms. Building the code in debug mode caused memmove to run ~160ms.

Release Results
$ ./big_memcpy_test
To specify a custom buffer size: big_memcpy_test
Using built in buffer size: 1.00GB
malloc for 1.00GB took 0ms
memset for 1.00GB took 320ms (3.12GB bytes/sec)
memcpy for 1.00GB took 298ms (3.36GB bytes/sec)
memmove for 1.00GB took 298ms (3.36GB bytes/sec)

Debug Results

$ ./big_memcpy_test                                                             
To specify a custom buffer size: big_memcpy_test                                 
Using built in buffer size: 1.00GB                                                           
malloc for 1.00GB took 0ms                                                                   
memset for 1.00GB took 323ms (3.10GB bytes/sec)                                              
memcpy for 1.00GB took 298ms (3.36GB bytes/sec)                                              
memmove for 1.00GB took 159ms (6.29GB bytes/sec)

It looks like gcc is optimizing the memmove into a memcpy because it “knows” that the pointers are not overlapping. If i put the memmove() call inside of a function and call that function the memmove doesn’t get translated to a memcpy() and the results in release mode match those in debug mode.

Function Wrapped memmove Release Mode

$ ./big_memcpy_test                                                             
To specify a custom buffer size: big_memcpy_test                                 
Using built in buffer size: 1.00GB                                                           
malloc for 1.00GB took 0ms                                                                   
memset for 1.00GB took 306ms (3.27GB bytes/sec)                                              
memcpy for 1.00GB took 298ms (3.36GB bytes/sec)                                              
memmove for 1.00GB took 159ms (6.29GB bytes/sec)

The interesting part is that memmove() on my laptop runs at the exact same speed as the servers, but is slower than memcpy() on my laptop. (laptop is running the same version of Scientific Linux)

Function Wrapped memmove Laptop Release Mode

$ ./big_memcpy_test 
To specify a custom buffer size: big_memcpy_test  
Using built in buffer size: 1.00GB
malloc for 1.00GB took 0ms
memset for 1.00GB took 130ms (7.69GB bytes/sec)
memcpy for 1.00GB took 111ms (9.01GB bytes/sec)
memmove for 1.00GB took 159ms (6.29GB bytes/sec)

@njuffa I agree that it is puzzling. In looking around, I have found that the memcpy() implementation changed in glibc 2.12 (which is used in CentOS 6) to copying backwards (Glibc change exposing bugs [LWN.net]), but I’m not sure if that is related. I would be interested in seeing results from CentOS 6.4 or 6.3 – I don’t recall this issue in the past, but all of the CentOS machines to which I have access have been updated.

@sixgig I see similar results as you on a similar system, so I doubt that BIOS settings need to be changed. You can confirm your DRAM clock frequency via dmidecode -t memory. Perhaps getting a different memcpy/memset/memmove implementation and using LD_PRELOAD could be an interim solution.

@sixgig: Are these throughput numbers measured for host-to-host copies between a pinned buffer and pageable memory, or simply inside pageable memory (like STREAM)? There could be a difference.

If you are using a pinned buffer, you might want to vary its size to see if that has any effect. A large buffer may have negative impact on system performance. The CUDA driver allocates a reasonably small pinned buffer and moves the data in chunks if necessary for large transfers.

@tbenson Memory clock frequencies are the exact same on the laptop and the server. This is what is concerning, i would except the throughput to be in the same ballpark.

@njuffa These throughputs are measured on host-to-host copies both allocated with malloc(). Neither of the buffers are pinned. Eventhing is being run with numactl to set CPU and NUMA controller affinity. Both my laptop and the server have 1600MHz memory. I am trying on a brand new server (1 month old) that has 1866MHz memory and i get the exact same results as the server with 1600MHz memory.

Edit: changed “Neither of the buffers are paged” to “Neither of the buffers are pinned”. Thanks njuffa :)

I assume you meant “neither buffer is pinned” as malloc() allocates regular pageable memory.

Four channels of DDR3-1886 memory should provide 59.7 GB/sec of theoretical bandwidth per socket, and I would expect copy throughput of around 24 GB/sec (= bandwidth of 48 GB/sec) if the memory subsystem is configured well.

In a recent case of unexpected system slowness observed with a remote server that was discussed in this forum it turned out the server had triggered some power or thermal limit that caused it to switch to a fail-safe mode with very low performance.

SBIOS settings (in particular as they pertain to system memory) are also worth examining, some have settings for “performance” vs “efficiency” modes for example. Since this appears to be a host-side issue, you may want to consider contacting the system vendor.

It would be nice if you oould inform us of the root cause once you have identified it.