cudaMemcpyDeviceToHost 3x slower than cudaMemcpyHostToDevice

Hi guys. Recently I found there is a slowing down in my app. After nvvp profiling, it seems it is caused by memcpy from Device to Host. cudaMemcpyDeviceToHost is 3x slower than cudaMemcpyHostToDevice. I can use a small example to reproduce it but I don’t know why.

The machine that I am using is Threadripper 1950X + two 1080 Ti on Asus ROG X399. Driver is 410.79 on Linux 16.04 with Cuda 9.0. Below is the code:

#include <iostream>
#include <cuda_runtime.h>
#include <cstdlib>
#include <cstring>

int main()
    int32_t size = 100000000;
    float *output_cpu = nullptr;
    cudaMallocHost((void **)&output_cpu, size * sizeof(float));

    float *output_gpu;
    cudaMalloc((void **)&output_gpu, size * sizeof(float));

    cudaEvent_t e1, e2;
    float time;

    for (int32_t i = 0; i < 1; i++) {
      cudaMemcpyAsync(output_cpu, output_gpu, size * sizeof(float), cudaMemcpyDeviceToHost);
      cudaEventElapsedTime(&time, e1, e2);
      std::cout << "bandwidth:" << size * sizeof(float) * 1.0f / 1000000000 / (time / 1000) << "G/s" << std::endl;

      cudaMemcpyAsync(output_gpu, output_cpu, size * sizeof(float), cudaMemcpyHostToDevice);
      cudaEventElapsedTime(&time, e1, e2);
      std::cout << "bandwidth:" << size * sizeof(float) * 1.0f / 1000000000 / (time / 1000) << "G/s" << std::endl;
    return 0;

The result is like:

I checked the affinity as well:

~/cuda-workspace/Test/src$ nvidia-smi topo -m
	GPU0	GPU1	CPU Affinity
GPU0	 X 	SYS	0-31
GPU1	SYS	 X 	0-31

I am not able to tell any useful information. So could you please provide some insight? I’ve been working on it for a couple of days but there is still no clue. I really appreciate it!

Other than that your program is missing an #include (to import the int32_t type) I see nothing wrong with the repro code, and when I ran it on my system it reported bandwidth for the two directions that was within 10% of each other, which is expected. In general, with throughput measurements it is best to repeat the measurements a few times, as the first measurement in particular may be affected by cold start issues. So what I would suggest to improve robustness is to put an outer loop around your current ‘i’-loop that runs a few repetitions.

I haven’t seen a case like this with a single-socket system before, but I suspect this may have something to do with the internal architecture of the Threadripper, which to my understanding comprises multiple compute clusters connected by a high-speed interconnect, where each cluster has its own memory controller, giving the whole Threadripper NUMA qualities.

What I would suggest as an experiment is to use numactl (or an equivalent tool) to control memory affinity and CPU affinity so GPU0 “talks” to the “nearest” cluster and the the “nearest” memory controller. I am not sure how to do that, but the goal would be to restrict interaction with the GPU to one half (or one quadrant) of the Threadripper, e.g. cores 0-3 only. I have no hands-on experience with Threadripper, maybe some other forum participant can provide insights.

Even considering (as a working hypothesis) that your observation boils down to some sort of NUMA issue, a throughput difference by a factor of 3x doesn’t make sense to me right now.

[Later:] A quick Google search tells me the Threadripper 1950X consists of two physical chips, each of which consists of two cluster with four cores each, for sixteen cores total. As there is hyperthreading support, this means the first cluster comprises cores 0-3 and threads 0-7. You would want to restrict affinity to that one quadrant for the experiment. I also read that AMD provides a utility for switching the processor between UMA and NUMA modes, so maybe play with that as well to see whether it has an impact on your application. Apparently the UMA mode somehow spreads memory activity across all available memory channels, although it is not cleat to me how the load distribution works in that configuration.