When would cudaMemPrefetchAsync exhibit synchronous behavior?

eee4017 · February 12, 2020, 7:23am

Hi, I found a strange behavior that sometimes cudaMemPrefetchAsync seems to be synchronous with a simple test program below. This program issues API with a sequence Prefetch/Prefetch/Prefetch/Prefetch/KernelLaunch/KernelLaunch/KernelLaunch/KernelLaunch (PPPPKKKK).

#include <cuda.h>
#include <cuda_profiler_api.h>
#include <stdio.h>
#include <sys/time.h>
#include <unistd.h>

// sleep_cycles >= 1e7 or it would cause strange behavior
__global__ void gpu_sleep(long long *d_o, long long sleep_cycles) {
  long long start = clock64();
  long long cycles_elapsed;
  do {
    cycles_elapsed = clock64() - start;
  } while (cycles_elapsed < sleep_cycles);
  if (d_o != NULL) {
    *d_o = cycles_elapsed;
  }
}

cudaStream_t kernel_stream;
const long long GPU_SLEEP_TIME = (1 << 24);
const long long DATA_SIZE = (1 << 30); // large data

#define BLOCK_NUM 4
long long *d_ptr[BLOCK_NUM];
long long d_ptr_size[BLOCK_NUM] = {DATA_SIZE, DATA_SIZE, DATA_SIZE, DATA_SIZE};

int main() {
  for (int i = 0; i < BLOCK_NUM; i++)
    cudaMallocManaged(&d_ptr[i], d_ptr_size[i]);

  cudaStreamCreate(&kernel_stream);

  for (int i = 0; i < BLOCK_NUM; i++) {
    cudaMemPrefetchAsync(d_ptr[i], d_ptr_size[i], 0, kernel_stream);
  }
  for (int i = 0; i < BLOCK_NUM; i++) {
    gpu_sleep<<<1, 1, 0, kernel_stream>>>(d_ptr[i], GPU_SLEEP_TIME);
  }

  for (int i = 0; i < BLOCK_NUM; i++)
    cudaFree(d_ptr[i]);

  cudaDeviceSynchronize();
}

With nvvp this would look like,
https://ibb.co/QpzQJdC

I have tested this program on both GeForce RTX 2080 Ti and TESLA-V100 with CUDA 10.2.

If we profile this program with nvprof, we could see some of cudaMemPrefetchAsync seems to execute synchronously regarded to the host.

If we change the API issue sequence from PPPPKKKK to PKPKPKPK, we could found the first cudaMemPrefetchAsync would also seem to be synchronous.

// change 33~38 lines to
for (int i = 0; i < BLOCK_NUM; i++) {
  cudaMemPrefetchAsync(d_ptr[i], d_ptr_size[i], 0, kernel_stream);
  gpu_sleep<<<1, 1, 0, kernel_stream>>>(d_ptr[i], GPU_SLEEP_TIME);
}

With nvvp this would look like,
https://ibb.co/hD6DXWq

Why there would be this strange behavior?

Thanks a lot for your help.

mnicely · February 12, 2020, 4:41pm

You’re putting all the copies and kernel in the same stream. Operations in a stream run in-order, operation in multiple streams run out-of-order. I believe you want four.

Checkout https://devblogs.nvidia.com/how-overlap-data-transfers-cuda-cc/

eee4017 · February 13, 2020, 2:16am

Hi,

I understand that issuing the cudaMemPrefetchAsync and Kernel in the same stream will causing in-order execution, and the kernel and copies would not overlap with respect to the GPU side.

However, I am more curious about synchronous behavior with respect to the host side. I drew the below figure to illustrate that some cudaMemPrefetchAsync do not return asynchronously.

https://i.ibb.co/Y2XbXr1/GEN-01.jpg
https://ibb.co/sjCtCBN

This phenomenon especially occurs when I try to call cudaMemPrefetchAsync when there is no kernel within the stream. I cannot figure out the reason.

Thanks.

mnicely · February 13, 2020, 1:59pm

I’m still not understanding the question, but let me try to explain it another way. I’m guessing because you are seeing some API calls take much longer than the others they are running synchronously??? That’s not how it works. When you’re looking at the CUDA API row in the profiler, you should see all commands serially. Because that’s how they appear in host code. Host code runs serially. Those commands are sending instructions to the device. If you are using asynchronous calls with multiple streams, the device code then has the ability to run commands in parallel (if resources are available).

If you look at your image, 200212-1146-47 — ImgBB, and zoom in after the cudaMemPrefetchAsync, you should see all API calls serially. This is expected behaviour.

eee4017 · February 13, 2020, 3:56pm

Hi,

Sorry that I might not state my question clearly. Let me explain it again.

I think both cudaMemPrefetch and KernelLaunch should return immediately when I call it.

I just want to know, why some of cudaMemPrefetch take a longer time to return?

I don’t care how these calls run on the GPU.

In my example, all cudaMemPrefetch should send the same instructions to the device, while some of
them take a longer time to return.

I am working on a much more complex project that this unexpected calling overhead will cause some problems. So if it is possible, I want to ask when would async calls take a longer time to return?

Thanks.

Robert_Crovella · February 13, 2020, 11:33pm

cudaMemPrefetchAsync can have at least 2 modalities under the hood.

https://devblogs.nvidia.com/maximizing-unified-memory-performance-cuda/

begin reading starting at:

“There are specific rules on how prefetching interacts with CUDA streams.”

Topic		Replies	Views
How to use cudaMemPrefetchAsync with sync kernels ? CUDA Programming and Performance	0	485	May 17, 2017
Performance Issue using cudaMemPrefetchAsync CUDA Programming and Performance	7	2536	May 24, 2017
is kernel in stream 0 asynchronous? CUDA Programming and Performance	10	3745	April 23, 2011
performance variation when using asynchronous calls CUDA Programming and Performance	1	638	February 11, 2011
Kernel execution blocks CPU code CUDA Programming and Performance	9	3995	September 8, 2009
Async questions Kernels appear to stall host threads CUDA Programming and Performance	3	2279	January 20, 2008
Is cudaMemset actually "asynchronous"? CUDA Programming and Performance	5	8021	January 5, 2016
Kernel Synchronization in CUDA not fully explained in programming guild CUDA Programming and Performance	1	10652	February 25, 2010
asyncAPI sample question CUDA Programming and Performance	9	5074	December 18, 2007
cudaMemcpy during kernel execution asynchronous kernel launch CUDA Programming and Performance	2	3103	July 20, 2007

When would cudaMemPrefetchAsync exhibit synchronous behavior?

Related topics