The time spent on cudaMemcpy when copying data from CPU to GPU occasionally fluctuates significantly

I have a program, every 500 microseconds, a float array of size 709 is generated on the CPU, and I use the GPU to compute the result and return it to the CPU. The process is as follows:

float inference(std::vector<float> &input_h) {
    cudaMemcpy(input_d, input_h, size * sizeof(float), cudaMemcpyHostToDevice);
    kernel<<<...>>>(input_d, output_d, size);
    cudaMemcpy(output_h, output_d, size * sizeof(float), cudaMemcpyDeviceToHost);
    return output_h;
}

When running, I found that the time spent on inference occasionally increases significantly by more than tenfold. I used nsys to profile and discovered that the time spent in the cudaMemcpy function fluctuates.

To further analyze this issue, I tested the time spent on cudaMemcpy. The program is as follows:

#include <cuda_runtime.h>
#include <iostream>
#include <random>

#define BLOCK_SIZE 1024

#define CHECK_CUDA_ERROR(call)                                                 \
  do {                                                                         \
    cudaError_t err = call;                                                    \
    if (err != cudaSuccess) {                                                  \
      std::cerr << "CUDA error at " << __FILE__ << ":" << __LINE__ << ": "     \
                << cudaGetErrorString(err) << std::endl;                       \
      exit(1);                                                                 \
    }                                                                          \
  } while (0)

__global__ void demo_kernel(float *input, float *output, size_t size) {
  size_t idx = blockIdx.x * blockDim.x + threadIdx.x;
  if (idx < size) {
    output[idx] = input[idx] * 3;
  }
}

void inference(float *input_h, float *input_d, float *output_d, size_t size) {
  cudaMemcpy(input_d, input_h, size * sizeof(float), cudaMemcpyHostToDevice);
  size_t num_blocks = (size + BLOCK_SIZE - 1) / BLOCK_SIZE;
  demo_kernel<<<num_blocks, BLOCK_SIZE>>>(input_d, output_d, size);
  // ignore copy output back to cpu
}

void testMemoryTransfer(size_t size) {
  float *input_h;
  float *input_d;
  float *output_d;
  CHECK_CUDA_ERROR(cudaMallocHost(&input_h, size * sizeof(float)));
  CHECK_CUDA_ERROR(cudaMalloc(&input_d, size * sizeof(float)));
  CHECK_CUDA_ERROR(cudaMalloc(&output_d, size * sizeof(float)));

  std::random_device rd;
  std::mt19937 gen(rd());
  std::uniform_real_distribution<> dis(0.0, 1.0);

  cudaEvent_t start, stop;
  CHECK_CUDA_ERROR(cudaEventCreate(&start));
  CHECK_CUDA_ERROR(cudaEventCreate(&stop));

  for (int i = 0; i < 10000; ++i) {
    // generate some random data
    for (size_t i = 0; i < size; ++i) {
      input_h[i] = dis(gen);
    }
    // test the inference time
    CHECK_CUDA_ERROR(cudaEventRecord(start, 0));
    inference(input_h, input_d, output_d, size);
    CHECK_CUDA_ERROR(cudaEventRecord(stop, 0));
    CHECK_CUDA_ERROR(cudaEventSynchronize(stop));

    float milliseconds = 0;
    CHECK_CUDA_ERROR(cudaEventElapsedTime(&milliseconds, start, stop));
    std::cout << milliseconds << std::endl;
  }

  // Clean up
  CHECK_CUDA_ERROR(cudaEventDestroy(start));
  CHECK_CUDA_ERROR(cudaEventDestroy(stop));
  CHECK_CUDA_ERROR(cudaFree(input_d));
  CHECK_CUDA_ERROR(cudaFree(output_d));
  CHECK_CUDA_ERROR(cudaFreeHost(input_h));
}

int main() {
  cudaSetDevice(0);
  testMemoryTransfer(709);
  return 0;
}

Run nvcc test.cu -o test && ./test > a.txt, and then use the following Python program to visualize:

import matplotlib.pyplot as plt
import numpy as np

data = np.loadtxt("a.txt")
plt.figure(figsize=(20, 6))
plt.plot(data)
plt.ylabel("time (ms)")
plt.yscale('log')
plt.show()

I observed that cudaMemcpy indeed occasionally slows down significantly. Why is this happening? How can I avoid the significant fluctuations in data transfer times?

One possibility could be that the GPU is being used for something else from time to time, such as graphics.

To generalize Robert Crovella’s point: timing difference can be caused by any kind of contention for a shared resource, be that GPU, PCIe interconnect, system memory, CPU. This could be exacerbated by the “stampeding herd” effect seen in time-sliced operating systems, i.e. various activities are clustered at the start of each time slice.

Is this observed on a system with dual CPU sockets, or using a CPU built from multiple chiplets internally? If so, try using numactl or a similar tool to set the processor and memory affinity of the process such that the GPU is always communicating with “near” CPU cores and associated “near” memory.

I don’t know what kind of guarantees NVIDIA provides for nsys, but profiler artifacts are a possibility at least in principle, e.g. having to do with sampling frequencies.

Other effects resulting from dynamic CPU and GPU clocking could be in play here, but given that this activity occurs every 500 microseconds that seems unlikely to me.

here is my system info, i will try numactl, thank you for your help

I try the following numactl commands, but the time of cudaMemcpy still occasionally increases significantly.

âžś  cuda_copy_survey numactl --hardware                                 
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191
node 0 size: 1019858 MB
node 0 free: 776355 MB
node 1 cpus: 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254
node 1 size: 1032112 MB
node 1 free: 449028 MB
node distances:
node   0   1 
  0:  10  32 
  1:  32  10 
âžś  cuda_copy_survey numactl --cpunodebind=0 --membind=0 ./a.out > a.txt
âžś  cuda_copy_survey numactl --cpunodebind=1 --membind=1 ./a.out > a.txt

This EPYC CPU is certainly built from chiplets where each chiplets provides a number of PCIe lanes and memory controllers. I think in this case it might be eight core complexes, each comprising 8 CPU cores, 16 PCIe 4 lanes, and one DDR4 memory channel (for a total of 64 cores, 128 PCIe lanes, and eight memory channels).

I am not familiar enough with the topology to assess whether these numactl bindings properly match each of the four GPUs in the system to the nearest core complex.

2 Likes