Kernel lunch overhead increases significantly (10x) when using unified memory on TK1 and TX1

Hi,

I have a task to modify cuda based image processing code to use unified memory on both TK1 and TX2. I started by locating all buffers that are required to be copied to the GPU memory using cudaMemcpy and changed that to bufferes allocated using cudaMallocmanaged. I made the required changes, and the code is running correctly. The issue is that it is running slower on both platforms. The profile (obtained by nvprof) on both shows that the problem is the kernel lunch api (cudaLaunch). Its execution time increased significantly (from 709 ms for all kernels to 10.9366 s on TX2). I read in some documents that in older GPUs, all data has to be migrated before kernel launch even if it is not used. However, in TX2 and TK1, both the GPU and CPU are using the same physical memory and it is expected that this migration, if required, will not have that significant overhead.

Can any help explaining what is the difference from the run-time prospective, between launching the same kernel on data allocated using cudaMallocManaged and cudaMalloc?

Thanks

Hi,

This blog can give you more information:
https://devblogs.nvidia.com/maximizing-unified-memory-performance-cuda/

As you know, the complicated page fault mechanism of unified memory may introduce some overhead.
But the value of unified memory is to automatically sync-up data and release user from the data-copying.

Thanks.

Hi

Thank you very much for your answer. The blog page you gave me is helpful.

The issue is that it describes techniques for the general CPU/GPU architecture, where each processor has its own physical memory.

In Jetson TK1, as I understand it, the CPU and GPU physically shares the memory. So the zero copy here can be achieved with minimal overhead. The issue is that is not the case, and the overhead is high.

The other issue is that pre-fetching is not supported on TK1 and so the techniques presented in the blog page will not work.

So to make my question clear, is the memory on TK1 and TX2 physically unified? and if so, why there is a significant overhead when migrating from the CPU to the GPU and back?

Thanks in advance

Hi,

On Jetson, the physical memory is shared for the whole system, including CPU and GPU.

There are three possible mechanism mentioned here:
1. cudaMalloc()
A typical malloc function for GPU memory
2. cudaHostAlloc() with cudaHostAllocMapped
This allocates pinned CPU memory and it is accessible for GPU(since it won’t be swap-out)
3. cudaMallocmanaged()
A unified memory can be access by both CPU and GPU.

There is two issue you should know:

A. Zero-copy (2. and 3.) and separate memory (1.)
This is trade-off.

Zero-copy memory release you from memory copy but cost you some overhead when launching.
The different comes from the access time of the allocated memory.
cudaMalloc always assign the data to the best access place.

But cudaMalloc will require you to copy memory each time. (and you will need to handle this)
If you have a large data(ex. image) or the data is frequently used by CPU/GPU, you may pay a lot of time in data transmission.

B. Different between pinned memory and unified memory
Usually, unified memory should have better performance.
Here is the explanation from our document
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-introduction
Unified Memory offers a “single-pointer-to-data” model that is conceptually similar to CUDA’s zero-copy memory. One key difference between the two is that with zero-copy allocations the physical location of memory is pinned in CPU system memory such that a program may have fast or slow access to it depending on where it is being accessed from. Unified Memory, on the other hand, decouples memory and execution spaces so that all data accesses are fast.

Thanks.

Hi

Thank you for you explanation. However,it is clear in our case that the overhead of unified memory is significant on TK1 and TX2. It is even more significant the memory copy itself. Consider the code example below. In this example I’m using the 3 different mechanisms to compare them. Depending on the value of the variable mode, I select which mechanism to use, with some variations to the managed mode.

Mode = 0: Managed with CPU initialization
Mode = 1: Memory copy
Mode = 2: Pinned (zero copy)
Mode = 4: Managed with GPU initialization
Mode = 5: Managed with pre-fetching and memory advising

It is always Mode 1 gives the best performance, in all platforms. Because in my example, the initialisation is performed before the kernel actually accesses the data, in mode 0, the kernel launch takes significantly longer time compared to mode 1 and even mode 2. When, I perform the initialisation on the GPU (mode 4), the kernel launch time decreases to become close to mode 1. Mode 2 always slower because of the cache issue. Mode 5 didn’t really help, it seems I’m not using the pre-fetching correctly.

I can not attach photos to this thread to show the profiling result on TX2 which will explain more my point.

I wish you can help me explain what I’m missing in this code example to improve the performance of the managed memory. But just using nvprof to get total execution time for each API, here is sum for cudaLaunch:

Mode 0: 120.17 ms
Mode 1: 11 ms
Mode 2: 52 ms

#include <iostream>
#include <math.h>
 
// CUDA kernel to add elements of two arrays
__global__
void add(int n, float *x, float *y)
{
  int index = blockIdx.x * blockDim.x + threadIdx.x;
  int stride = blockDim.x * gridDim.x;
  for (int i = index; i < n; i += stride)
    y[i] = x[i] + y[i];
}

// CUDA kernel to add elements of two arrays
__global__
void initialize(int n, float *x, float *y)
{
  int index = blockIdx.x * blockDim.x + threadIdx.x;
  int stride = blockDim.x * gridDim.x;
  for (int i = index; i < n; i += stride)
  {
    x[i] = 1.0f;
    y[i] = 2.0f;
  }
}

int DoAddManaged (int mode)
{
  int N = 1<<20;
  float *x, *y;
 
  // Allocate Unified Memory -- accessible from CPU or GPU
  
  cudaMallocManaged(&x, N*sizeof(float));
  cudaMallocManaged(&y, N*sizeof(float));
  
 #ifndef TK1 
  if (mode==5)
  {
    cudaMemAdvise(x, N*sizeof(float),cudaMemAdviseSetPreferredLocation, cudaCpuDeviceId  );
    cudaMemAdvise(y, N*sizeof(float),cudaMemAdviseSetPreferredLocation, cudaCpuDeviceId  );
  
    cudaMemPrefetchAsync (x,N*sizeof(float),cudaCpuDeviceId);
    cudaMemPrefetchAsync (y,N*sizeof(float),cudaCpuDeviceId);
  }
#endif
  
  //cudaMemPrefetchAsync (x,N/2,0);
  //cudaMemPrefetchAsync (y,N/2,0);
  
  //std::cout << "X pointer = " << std::hex << x << " Y pointer = " << std::hex << y << std::endl;

  // Calculate the block size and number of blocks
  int blockSize = 256;
  int numBlocks = (N + blockSize - 1) / blockSize;

  // initialize x and y arrays on the host
  
  if (mode==4)
  {
    initialize<<<numBlocks, blockSize>>>(N, x, y);

  }
  else
  {
    for (int i = 0; i < N; i++) 
    {
      x[i] = 1.0f;
      y[i] = 2.0f;
    }
  }

  #ifndef TK1
  if (mode==5)
  {
    cudaMemPrefetchAsync (x,N*sizeof(float),0);
    cudaMemPrefetchAsync (y,N*sizeof(float),0); 
  }
  #endif

// Launch kernel on 1M elements on the GPU
  add<<<numBlocks, blockSize>>>(N, x, y);
 
  // Wait for GPU to finish before accessing on host
  cudaDeviceSynchronize();
 
  // Check for errors (all values should be 3.0f)
  float maxError = 0.0f;
  for (int i = 0; i < N; i++)
    maxError = fmax(maxError, fabs(y[i]-3.0f));
  //std::cout << "Max error: " << maxError << std::endl;
  //if (mode==4)
  //add<<<numBlocks, blockSize>>>(N, x, y);

  // Free memory
  cudaFree(x);
  cudaFree(y);
 
  return maxError;
}

int DoAddCopy()
{
  int N = 1<<20;
  
  // Allocate host memory
  float *x = new float[N], *y = new float[N];
 
  //std::cout << "X pointer = " << std::hex << x << " Y pointer = " << std::hex << y << std::endl;

  // Device arrays
  float *d_x, *d_y;

// Allocate device memory, only accessable from the GPU
  cudaMalloc((void **) &d_x,  N*sizeof(float));
  cudaMalloc((void **) &d_y,  N*sizeof(float));

  //std::cout << "Device X pointer = " << std::hex << d_x << " Device Y pointer = " << std::hex << d_y << std::endl;
 
  // initialize x and y arrays on the host
  for (int i = 0; i < N; i++) {
    x[i] = 1.0f;
    y[i] = 2.0f;
  }

  // Copy array contents of input from the host (CPU) to the device (GPU)
  cudaMemcpy(d_x, x, N * sizeof(float), cudaMemcpyHostToDevice);
  cudaMemcpy(d_y, y, N * sizeof(float), cudaMemcpyHostToDevice); 

  // Launch kernel on 1M elements on the GPU
  int blockSize = 256;
  int numBlocks = (N + blockSize - 1) / blockSize;
  add<<<numBlocks, blockSize>>>(N, d_x, d_y);
 
  // Wait for GPU to finish before accessing on host
  //cudaDeviceSynchronize(); // May be I don't need it here
 
  // Copy result back
  cudaMemcpy(y, d_y, N * sizeof(float), cudaMemcpyDeviceToHost);

  // Check for errors (all values should be 3.0f)
  float maxError = 0.0f;
  for (int i = 0; i < N; i++)
    maxError = fmax(maxError, fabs(y[i]-3.0f));
  //std::cout << "Max error: " << maxError << std::endl;
 
  // Free memory
  cudaFree(d_x);
  cudaFree(d_y);
 
  delete x;
  delete y;
  return maxError;
}

int DoAddZeroCopy()
{
 int N = 1<<20;
  
  // Declare empty pointers
  float *x = NULL, *y = NULL;

  // Allocate host memory
  cudaHostAlloc((void **)&x, N*sizeof(float), cudaHostAllocMapped);
  cudaHostAlloc((void **)&y, N*sizeof(float), cudaHostAllocMapped);

  //std::cout << "X pointer = " << std::hex << x << " Y pointer = " << std::hex << y << std::endl;

  // initialize x and y arrays on the host
  for (int i = 0; i < N; i++) {
    x[i] = 1.0f;
    y[i] = 2.0f;
  }

  // Device pointers
  float *d_x, *d_y;
 
  // Get device pointer from host memory. No allocation or memcpy
  cudaHostGetDevicePointer((void **)&d_x, (void *) x , 0);
  cudaHostGetDevicePointer((void **)&d_y, (void *) y, 0); 
  
  //std::cout << "Device X pointer = " << std::hex << d_x << " Device Y pointer = " << std::hex << d_y << std::endl;

  // no need to memory copy

  // Launch kernel on 1M elements on the GPU
  int blockSize = 256;
  int numBlocks = (N + blockSize - 1) / blockSize;
  add<<<numBlocks, blockSize>>>(N, d_x, d_y);
 
  // Wait for GPU to finish before accessing on host
  cudaDeviceSynchronize(); // May be I don't need it here
 
  // No need to copy back

  // Check for errors (all values should be 3.0f)
  float maxError = 0.0f;
  for (int i = 0; i < N; i++)
    maxError = fmax(maxError, fabs(y[i]-3.0f));
  //std::cout << "Max error: " << maxError << std::endl;
 
  // Free memory
  cudaFreeHost(x);
  cudaFreeHost(y);

return maxError;
}
 
int main(int argc, char **argv)
{ 
  int mode = 0;
  int loop = 10;
  int i = 0;
  int rval = 0;
  if (argc==3)
  {
	mode = atoi(argv[1]);
	loop = atoi(argv[2]);
  }

  int device_count=0;

  cudaGetDeviceCount	(&device_count); 	

std::cout << "Mode = " << mode << " , Loop = " << loop << std::endl;
  std::cout << "The number of cuda devices = " << device_count << std::endl;

  switch(mode)
  {
    case 0:
    case 4:	
	for (i=0;i<loop;i++)#include <iostream>
#include <math.h>
 
// CUDA kernel to add elements of two arrays
__global__
void add(int n, float *x, float *y)
{
  int index = blockIdx.x * blockDim.x + threadIdx.x;
  int stride = blockDim.x * gridDim.x;
  for (int i = index; i < n; i += stride)
    y[i] = x[i] + y[i];
}

// CUDA kernel to add elements of two arrays
__global__
void initialize(int n, float *x, float *y)
{
  int index = blockIdx.x * blockDim.x + threadIdx.x;
  int stride = blockDim.x * gridDim.x;
  for (int i = index; i < n; i += stride)
  {
    x[i] = 1.0f;
    y[i] = 2.0f;
  }
}

int DoAddManaged (int mode)
{
  int N = 1<<20;
  float *x, *y;
 
  // Allocate Unified Memory -- accessible from CPU or GPU
  
  cudaMallocManaged(&x, N*sizeof(float));
  cudaMallocManaged(&y, N*sizeof(float));
  
 #ifndef TK1 
  if (mode==5)
  {
    cudaMemAdvise(x, N*sizeof(float),cudaMemAdviseSetPreferredLocation, cudaCpuDeviceId  );
    cudaMemAdvise(y, N*sizeof(float),cudaMemAdviseSetPreferredLocation, cudaCpuDeviceId  );
  
    cudaMemPrefetchAsync (x,N*sizeof(float),cudaCpuDeviceId);
    cudaMemPrefetchAsync (y,N*sizeof(float),cudaCpuDeviceId);
  }
#endif
  
  //cudaMemPrefetchAsync (x,N/2,0);
  //cudaMemPrefetchAsync (y,N/2,0);
  
  //std::cout << "X pointer = " << std::hex << x << " Y pointer = " << std::hex << y << std::endl;

  // Calculate the block size and number of blocks
  int blockSize = 256;
  int numBlocks = (N + blockSize - 1) / blockSize;

  // initialize x and y arrays on the host
  
  if (mode==4)
  {
    initialize<<<numBlocks, blockSize>>>(N, x, y);

  }
  else
  {
    for (int i = 0; i < N; i++) 
    {
      x[i] = 1.0f;
      y[i] = 2.0f;
    }
  }

  #ifndef TK1
  if (mode==5)
  {
    cudaMemPrefetchAsync (x,N*sizeof(float),0);
    cudaMemPrefetchAsync (y,N*sizeof(float),0); 
  }
  #endif

// Launch kernel on 1M elements on the GPU
  add<<<numBlocks, blockSize>>>(N, x, y);
 
  // Wait for GPU to finish before accessing on host
  cudaDeviceSynchronize();
 
  // Check for errors (all values should be 3.0f)
  float maxError = 0.0f;
  for (int i = 0; i < N; i++)
    maxError = fmax(maxError, fabs(y[i]-3.0f));
  //std::cout << "Max error: " << maxError << std::endl;
  //if (mode==4)
  //add<<<numBlocks, blockSize>>>(N, x, y);

  // Free memory
  cudaFree(x);
  cudaFree(y);
 
  return maxError;
}

int DoAddCopy()
{
  int N = 1<<20;
  
  // Allocate host memory
  float *x = new float[N], *y = new float[N];
 
  //std::cout << "X pointer = " << std::hex << x << " Y pointer = " << std::hex << y << std::endl;

  // Device arrays
  float *d_x, *d_y;

// Allocate device memory, only accessable from the GPU
  cudaMalloc((void **) &d_x,  N*sizeof(float));
  cudaMalloc((void **) &d_y,  N*sizeof(float));

  //std::cout << "Device X pointer = " << std::hex << d_x << " Device Y pointer = " << std::hex << d_y << std::endl;
 
  // initialize x and y arrays on the host
  for (int i = 0; i < N; i++) {
    x[i] = 1.0f;
    y[i] = 2.0f;
  }

  // Copy array contents of input from the host (CPU) to the device (GPU)
  cudaMemcpy(d_x, x, N * sizeof(float), cudaMemcpyHostToDevice);
  cudaMemcpy(d_y, y, N * sizeof(float), cudaMemcpyHostToDevice); 

  // Launch kernel on 1M elements on the GPU
  int blockSize = 256;
  int numBlocks = (N + blockSize - 1) / blockSize;
  add<<<numBlocks, blockSize>>>(N, d_x, d_y);
 
  // Wait for GPU to finish before accessing on host
  //cudaDeviceSynchronize(); // May be I don't need it here
 
  // Copy result back
  cudaMemcpy(y, d_y, N * sizeof(float), cudaMemcpyDeviceToHost);

  // Check for errors (all values should be 3.0f)
  float maxError = 0.0f;
  for (int i = 0; i < N; i++)
    maxError = fmax(maxError, fabs(y[i]-3.0f));
  //std::cout << "Max error: " << maxError << std::endl;
 
  // Free memory
  cudaFree(d_x);
  cudaFree(d_y);
 
  delete x;
  delete y;
  return maxError;
}

int DoAddZeroCopy()
{
 int N = 1<<20;
  
  // Declare empty pointers
  float *x = NULL, *y = NULL;

  // Allocate host memory
  cudaHostAlloc((void **)&x, N*sizeof(float), cudaHostAllocMapped);
  cudaHostAlloc((void **)&y, N*sizeof(float), cudaHostAllocMapped);

  //std::cout << "X pointer = " << std::hex << x << " Y pointer = " << std::hex << y << std::endl;

  // initialize x and y arrays on the host
  for (int i = 0; i < N; i++) {
    x[i] = 1.0f;
    y[i] = 2.0f;
  }

  // Device pointers
  float *d_x, *d_y;
 
  // Get device pointer from host memory. No allocation or memcpy
  cudaHostGetDevicePointer((void **)&d_x, (void *) x , 0);
  cudaHostGetDevicePointer((void **)&d_y, (void *) y, 0); 
  
  //std::cout << "Device X pointer = " << std::hex << d_x << " Device Y pointer = " << std::hex << d_y << std::endl;

  // no need to memory copy

  // Launch kernel on 1M elements on the GPU
  int blockSize = 256;
  int numBlocks = (N + blockSize - 1) / blockSize;
  add<<<numBlocks, blockSize>>>(N, d_x, d_y);
 
  // Wait for GPU to finish before accessing on host
  cudaDeviceSynchronize(); // May be I don't need it here
 
  // No need to copy back

  // Check for errors (all values should be 3.0f)
  float maxError = 0.0f;
  for (int i = 0; i < N; i++)
    maxError = fmax(maxError, fabs(y[i]-3.0f));
  //std::cout << "Max error: " << maxError << std::endl;
 
  // Free memory
  cudaFreeHost(x);
  cudaFreeHost(y);

return maxError;
}
 
int main(int argc, char **argv)
{ 
  int mode = 0;
  int loop = 10;
  int i = 0;
  int rval = 0;
  if (argc==3)
  {
	mode = atoi(argv[1]);
	loop = atoi(argv[2]);
  }

  int device_count=0;

  cudaGetDeviceCount	(&device_count); 	

std::cout << "Mode = " << mode << " , Loop = " << loop << std::endl;
  std::cout << "The number of cuda devices = " << device_count << std::endl;

  switch(mode)
  {
    case 0:
    case 4:	
	for (i=0;i<loop;i++)
	 rval+=DoAddManaged (mode);
	break;
    case 1:
	for (i=0;i<loop;i++)
	 rval+=DoAddCopy();
	break;
    case 2:
	for (i=0;i<loop;i++)
	  rval+=DoAddZeroCopy();
	break;

    default:
	for (i=0;i<loop;i++)
	  rval+=DoAddManaged(mode);
	break;
  }

  return rval;
}

	 rval+=DoAddManaged (mode);
	break;
    case 1:
	for (i=0;i<loop;i++)
	 rval+=DoAddCopy();
	break;
    case 2:
	for (i=0;i<loop;i++)
	  rval+=DoAddZeroCopy();
	break;

    default:
	for (i=0;i<loop;i++)
	  rval+=DoAddManaged(mode);
	break;
  }

  return rval;
}

Hi,

Your CUDA kernel read/write data frequently.
The bottleneck may come from the access time of the different memory type.

Could you manage a kernel code with more calculation?
It can help us know more about the bottleneck from.

Thanks.