cudaMemcpyDeviceToHost 3x slower than cudaMemcpyHostToDevice

falefa · January 9, 2019, 8:36pm

Hi guys. Recently I found there is a slowing down in my app. After nvvp profiling, it seems it is caused by memcpy from Device to Host. cudaMemcpyDeviceToHost is 3x slower than cudaMemcpyHostToDevice. I can use a small example to reproduce it but I don’t know why.

The machine that I am using is Threadripper 1950X + two 1080 Ti on Asus ROG X399. Driver is 410.79 on Linux 16.04 with Cuda 9.0. Below is the code:

#include <iostream>
#include <cuda_runtime.h>
#include <cstdlib>
#include <cstring>

int main()
{
    int32_t size = 100000000;
    float *output_cpu = nullptr;
    cudaSetDevice(0);
    cudaMallocHost((void **)&output_cpu, size * sizeof(float));

    cudaSetDevice(0);
    float *output_gpu;
    cudaMalloc((void **)&output_gpu, size * sizeof(float));

    cudaEvent_t e1, e2;
    float time;
    cudaEventCreate(&e1);
    cudaEventCreate(&e2);

    for (int32_t i = 0; i < 1; i++) {
      cudaEventRecord(e1);
      cudaMemcpyAsync(output_cpu, output_gpu, size * sizeof(float), cudaMemcpyDeviceToHost);
      cudaEventRecord(e2);
      cudaEventSynchronize(e2);
      cudaEventElapsedTime(&time, e1, e2);
      std::cout << "bandwidth:" << size * sizeof(float) * 1.0f / 1000000000 / (time / 1000) << "G/s" << std::endl;

      cudaEventRecord(e1);
      cudaMemcpyAsync(output_gpu, output_cpu, size * sizeof(float), cudaMemcpyHostToDevice);
      cudaEventRecord(e2);
      cudaEventSynchronize(e2);
      cudaEventElapsedTime(&time, e1, e2);
      std::cout << "bandwidth:" << size * sizeof(float) * 1.0f / 1000000000 / (time / 1000) << "G/s" << std::endl;
    }
    return 0;
}

The result is like:
bandwidth:4.64906G/s
bandwidth:13.1255G/s

I checked the affinity as well:

~/cuda-workspace/Test/src$ nvidia-smi topo -m
	GPU0	GPU1	CPU Affinity
GPU0	 X 	SYS	0-31
GPU1	SYS	 X 	0-31

I am not able to tell any useful information. So could you please provide some insight? I’ve been working on it for a couple of days but there is still no clue. I really appreciate it!

njuffa · January 9, 2019, 9:52pm

Other than that your program is missing an #include (to import the int32_t type) I see nothing wrong with the repro code, and when I ran it on my system it reported bandwidth for the two directions that was within 10% of each other, which is expected. In general, with throughput measurements it is best to repeat the measurements a few times, as the first measurement in particular may be affected by cold start issues. So what I would suggest to improve robustness is to put an outer loop around your current ‘i’-loop that runs a few repetitions.

I haven’t seen a case like this with a single-socket system before, but I suspect this may have something to do with the internal architecture of the Threadripper, which to my understanding comprises multiple compute clusters connected by a high-speed interconnect, where each cluster has its own memory controller, giving the whole Threadripper NUMA qualities.

What I would suggest as an experiment is to use numactl (or an equivalent tool) to control memory affinity and CPU affinity so GPU0 “talks” to the “nearest” cluster and the the “nearest” memory controller. I am not sure how to do that, but the goal would be to restrict interaction with the GPU to one half (or one quadrant) of the Threadripper, e.g. cores 0-3 only. I have no hands-on experience with Threadripper, maybe some other forum participant can provide insights.

Even considering (as a working hypothesis) that your observation boils down to some sort of NUMA issue, a throughput difference by a factor of 3x doesn’t make sense to me right now.

[Later:] A quick Google search tells me the Threadripper 1950X consists of two physical chips, each of which consists of two cluster with four cores each, for sixteen cores total. As there is hyperthreading support, this means the first cluster comprises cores 0-3 and threads 0-7. You would want to restrict affinity to that one quadrant for the experiment. I also read that AMD provides a utility for switching the processor between UMA and NUMA modes, so maybe play with that as well to see whether it has an impact on your application. Apparently the UMA mode somehow spreads memory activity across all available memory channels, although it is not cleat to me how the load distribution works in that configuration.

Topic		Replies	Views
cudaMemcpyDeviceToHost speed how to improve speed CUDA Programming and Performance	3	12502	June 13, 2008
cudaMemcpyDeviceToHost taking much time? CUDA Programming and Performance	3	2653	July 15, 2009
cudaMemcpyDeviceToHost 200 x longer than cudaMemcpyHostToDevice ? CUDA Programming and Performance	2	1476	November 25, 2011
cudaMemcpyDeviceToHost - slow performance using pinned memory CUDA Programming and Performance	6	2806	June 24, 2016
Why cudaMemcpyDeviceToHost is too slowly? CUDA Programming and Performance	1	605	November 16, 2021
cudaMemcpy2DAsync a lot slower than cudaMemcpy normally CUDA Programming and Performance	6	113	August 22, 2024
`cudaMemcpyHostToDevice` is very slow CUDA Programming and Performance	8	1977	December 14, 2018
how to improve the memory allocation rate,data transfer rate from host to device and device to host CUDA Programming and Performance	9	5265	February 26, 2010
About CUDA CUDA Programming and Performance	2	4712	December 3, 2008
Zero-copy from host to device decreases cudaMemcpyAsync device to host performance CUDA Programming and Performance	0	554	January 27, 2020

cudaMemcpyDeviceToHost 3x slower than cudaMemcpyHostToDevice

Related topics