Jetpack 4.6.1 Cuda memcopy schedule bug

locki.janos.m · July 27, 2022, 10:10pm

Hello!

During our upgrade to Jetpack 4.6.1 from Jetpack 4.4 , we noticed something strange around cudaMemcpyAsync calls.
A H2D copy can be blocked by an earlier issued D2H copy from another non blocking Cuda stream.
This can cause big delays if D2H copy is last call in a long computation sequence.

Here is a small reproduction of the problem. It reproduces the bug on Jetson Xavier NX with 100% success rate.

#include <thread>
#include <iostream>
#include <cuda_runtime_api.h>

__global__
void saxpy(int n, float a, float *x, float *y)
{
  int i = blockIdx.x*blockDim.x + threadIdx.x;
  if (i < n) y[i] = a*x[i] + y[i];
}

void kernel_call()
{
  int N = 1<<20;
  float *d_x;
  float *d_y;
  float *h_y;

  cudaMalloc(&d_x, N*sizeof(float));
  cudaMalloc(&d_y, N*sizeof(float));
  cudaMallocHost(&h_y, N*sizeof(float));

  cudaStream_t stream;
  cudaEvent_t event;
  cudaStreamCreateWithFlags(&stream, cudaStreamNonBlocking);
  cudaEventCreateWithFlags(&event, cudaEventBlockingSync | cudaEventDisableTiming);

  for(size_t i = 0; i < 10000; ++i)
  {
    for(size_t j = 0; j < 50000; ++j)
    {
      saxpy<<<(N+255)/256, 256,0, stream>>>(N, 2.0f, d_x, d_y);
      if( cudaPeekAtLastError() != cudaSuccess)
        std::exit(1);
    }
    cudaMemcpyAsync(h_y, d_y, N*sizeof(float), cudaMemcpyDeviceToHost, stream);
    cudaEventRecord(event, stream);
    cudaEventSynchronize(event);
  }

  cudaFree(d_x);
  cudaFree(d_y);
  cudaFreeHost(h_y);
}

void upload_call(cudaStream_t stream)
{
  constexpr size_t N = 20'000'000;
  uint8_t *pinned_mem;
  uint8_t *device_mem;
  cudaMalloc(&device_mem, N*sizeof(uint8_t));
  cudaMallocHost(&pinned_mem, N*sizeof(uint8_t));

  cudaEvent_t event;
  cudaEventCreateWithFlags(&event, cudaEventBlockingSync | cudaEventDisableTiming);


  for(size_t i = 0; i < 10000; ++i)
  {
    auto t0 = std::chrono::steady_clock::now();
    auto line_size = 1024 * 3 * sizeof(uint8_t);
    auto transfer_size = 1024 * line_size;
    auto device_mem_start = device_mem + (i % 4) * transfer_size;
    auto pinned_mem_start = pinned_mem + (i % 4) * transfer_size;

    auto err = cudaMemcpy2DAsync(device_mem_start, line_size, pinned_mem_start, line_size, line_size, 1024, cudaMemcpyHostToDevice, stream);
    cudaEventRecord(event, stream);
    cudaEventSynchronize(event);
    auto t1 = std::chrono::steady_clock::now();
    if(err != cudaSuccess)
      std::exit(1);
    std::cout << std::chrono::duration_cast<std::chrono::microseconds>(t1-t0).count() <<std::endl;

    std::this_thread::sleep_for(std::chrono::milliseconds(100));
  }

  cudaFree(device_mem);
  cudaFreeHost(pinned_mem);
}


int main()
{
  cudaStream_t upload_stream;
  cudaStreamCreateWithFlags(&upload_stream, cudaStreamNonBlocking);

  std::thread upload(upload_call, upload_stream);
  std::thread kernel(kernel_call);

  kernel.join();
  upload.join();

  cudaStreamDestroy(upload_stream);
}

I hope you can create some bug fix patch for this in Jatpack 4.6.
Thanks for your help,

János

AastaLLL · July 28, 2022, 4:11am

Hi,

Thanks for your reporting.
We are checking this internally and will give you more information later.

Thanks.

AastaLLL · July 28, 2022, 4:21am

Hi,

Just want to confirm first.

Do you indicate the peak (ex. 238891) among the loop?
And this peak won’t happen on the JetPack 4.4+XavierNX?

Thanks.

locki.janos.m · July 28, 2022, 6:18am

Yeah , that peak is resulted by the wrong block between the copy commands.

You can change j counter end from 50000 to 5000 in this section:

It will results that every 9th h2d copy will be blocked by the other stream’s d2h copies.

If you change from 50000 to 500, then every h2d copy will be blocked. It can be seen totally clear in NshightSystem also.

I tested on Jetpack 4.4 and also on my x86 machine but I didn’t find this kind of defection.

AastaLLL · August 1, 2022, 4:45am

Thanks for the update.

We are checking this issue internally.
Will share more information with you later.

AastaLLL · August 2, 2022, 2:23am

Hi,

We have tested this on XavierNX+JetPack4.4.
The same issue still occurs.

Could you double-confirm if the same behavior can be seen on both JetPack versions?

Thanks.

AastaLLL · August 2, 2022, 3:11am

Hi,

Please try to increase the compute channel for XavierNX.
For example, we don’t see the blocking issue if we set the channel to 32.

$ export CUDA_DEVICE_MAX_CONNECTIONS=32
$ ./out

You can find more information about the environment variable in our document:

Thanks.

locki.janos.m · August 2, 2022, 7:36am

Hello,

I double checked, and it also occurs on XavierNX+JetPack4.4… My bad, sorry.

Setting CUDA_DEVICE_MAX_CONNECTIONS=32 solves the issue. And it works perfectly with value 4, 8 (default), 16 also until I export the variable.

Thanks for the help!

system · August 24, 2022, 2:53am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.