When would cudaMemPrefetchAsync exhibit synchronous behavior?

Hi, I found a strange behavior that sometimes cudaMemPrefetchAsync seems to be synchronous with a simple test program below. This program issues API with a sequence Prefetch/Prefetch/Prefetch/Prefetch/KernelLaunch/KernelLaunch/KernelLaunch/KernelLaunch (PPPPKKKK).

#include <cuda.h>
#include <cuda_profiler_api.h>
#include <stdio.h>
#include <sys/time.h>
#include <unistd.h>

// sleep_cycles >= 1e7 or it would cause strange behavior
__global__ void gpu_sleep(long long *d_o, long long sleep_cycles) {
  long long start = clock64();
  long long cycles_elapsed;
  do {
    cycles_elapsed = clock64() - start;
  } while (cycles_elapsed < sleep_cycles);
  if (d_o != NULL) {
    *d_o = cycles_elapsed;
  }
}

cudaStream_t kernel_stream;
const long long GPU_SLEEP_TIME = (1 << 24);
const long long DATA_SIZE = (1 << 30); // large data

#define BLOCK_NUM 4
long long *d_ptr[BLOCK_NUM];
long long d_ptr_size[BLOCK_NUM] = {DATA_SIZE, DATA_SIZE, DATA_SIZE, DATA_SIZE};

int main() {
  for (int i = 0; i < BLOCK_NUM; i++)
    cudaMallocManaged(&d_ptr[i], d_ptr_size[i]);

  cudaStreamCreate(&kernel_stream);

  for (int i = 0; i < BLOCK_NUM; i++) {
    cudaMemPrefetchAsync(d_ptr[i], d_ptr_size[i], 0, kernel_stream);
  }
  for (int i = 0; i < BLOCK_NUM; i++) {
    gpu_sleep<<<1, 1, 0, kernel_stream>>>(d_ptr[i], GPU_SLEEP_TIME);
  }

  for (int i = 0; i < BLOCK_NUM; i++)
    cudaFree(d_ptr[i]);

  cudaDeviceSynchronize();
}

With nvvp this would look like,
https://ibb.co/QpzQJdC

I have tested this program on both GeForce RTX 2080 Ti and TESLA-V100 with CUDA 10.2.

If we profile this program with nvprof, we could see some of cudaMemPrefetchAsync seems to execute synchronously regarded to the host.

If we change the API issue sequence from PPPPKKKK to PKPKPKPK, we could found the first cudaMemPrefetchAsync would also seem to be synchronous.

// change 33~38 lines to
for (int i = 0; i < BLOCK_NUM; i++) {
  cudaMemPrefetchAsync(d_ptr[i], d_ptr_size[i], 0, kernel_stream);
  gpu_sleep<<<1, 1, 0, kernel_stream>>>(d_ptr[i], GPU_SLEEP_TIME);
}

With nvvp this would look like,
https://ibb.co/hD6DXWq

Why there would be this strange behavior?

Thanks a lot for your help.

You’re putting all the copies and kernel in the same stream. Operations in a stream run in-order, operation in multiple streams run out-of-order. I believe you want four.

Checkout https://devblogs.nvidia.com/how-overlap-data-transfers-cuda-cc/

Hi,

I understand that issuing the cudaMemPrefetchAsync and Kernel in the same stream will causing in-order execution, and the kernel and copies would not overlap with respect to the GPU side.

However, I am more curious about synchronous behavior with respect to the host side. I drew the below figure to illustrate that some cudaMemPrefetchAsync do not return asynchronously.

https://i.ibb.co/Y2XbXr1/GEN-01.jpg
https://ibb.co/sjCtCBN

This phenomenon especially occurs when I try to call cudaMemPrefetchAsync when there is no kernel within the stream. I cannot figure out the reason.

Thanks.

I’m still not understanding the question, but let me try to explain it another way. I’m guessing because you are seeing some API calls take much longer than the others they are running synchronously??? That’s not how it works. When you’re looking at the CUDA API row in the profiler, you should see all commands serially. Because that’s how they appear in host code. Host code runs serially. Those commands are sending instructions to the device. If you are using asynchronous calls with multiple streams, the device code then has the ability to run commands in parallel (if resources are available).

If you look at your image, https://ibb.co/hD6DXWq, and zoom in after the cudaMemPrefetchAsync, you should see all API calls serially. This is expected behaviour.

Hi,

Sorry that I might not state my question clearly. Let me explain it again.

I think both cudaMemPrefetch and KernelLaunch should return immediately when I call it.

I just want to know, why some of cudaMemPrefetch take a longer time to return?

I don’t care how these calls run on the GPU.

In my example, all cudaMemPrefetch should send the same instructions to the device, while some of
them take a longer time to return.

I am working on a much more complex project that this unexpected calling overhead will cause some problems. So if it is possible, I want to ask when would async calls take a longer time to return?

Thanks.

cudaMemPrefetchAsync can have at least 2 modalities under the hood.

https://devblogs.nvidia.com/maximizing-unified-memory-performance-cuda/

begin reading starting at:

“There are specific rules on how prefetching interacts with CUDA streams.”