Memcopy between two cuda managed addresses much faster than between cuda managed and not managed

alteredmaze · March 6, 2025, 1:58pm

Sorry for the super basic question, but I had a c++ application where I performed a memcopy between memory allocated with cudaMallocHost and a regular ‘newed’ array, and I noticed it took a lot longer than when I did memcopy between two memory locations which were both allocated with cudaMallocHost.

I then ran some experiments and found if I simply try and memcopy a second time between the cudaMallocHost and normal ‘newed’ memory location, it was basically as fast as the two cudaMallocHost locations.

I also ran an experiment where I tried to ‘page lock’ the ‘regular newed’ memory location, and that didn’t seem to help.

Do you folks know what would theoretically be causing this?

Note: This is being run on google colab.

%%writefile memcopytest.cu

#include <iostream>

#include <chrono>
#include "cuda_fp16.h"


#include <sys/mman.h>


using namespace std;
using namespace std::chrono;



#define TWO_GIGS (2U*1024U*1024U*1024U)

int main()
{
  unsigned int numOfFloatsForFourGigs = TWO_GIGS / sizeof(float);

  float *myValues = new float[numOfFloatsForFourGigs];
  float *myValues2 = NULL;
  float *myValues3 = NULL;

  cudaMallocHost(&myValues2, sizeof(float) * numOfFloatsForFourGigs);
  cudaMallocHost(&myValues3, sizeof(float) * numOfFloatsForFourGigs);

  /////////////////////////////////////////////////////
  // Time a memcopy between 2 cuda managed arrays
  auto start = high_resolution_clock::now();
  memcpy(myValues2, myValues3, sizeof(float) * numOfFloatsForFourGigs);
  auto end = high_resolution_clock::now();

  auto duration = duration_cast<microseconds>(end - start);
  cout << "Managed Memory Memcpy Duration: " << duration.count() / 1000 << " milliseconds" << endl;
  //////////////////////////////////////////////////////


  //////////////////////////////////////////////////////
  // Time a memcopy between cuda managed array, and regular array.
  start = high_resolution_clock::now();
  memcpy(myValues, myValues3, sizeof(float) * numOfFloatsForFourGigs);
  end = high_resolution_clock::now();

  duration = duration_cast<microseconds>(end - start);
  cout << "Regular Memcpy Duration: " << duration.count() / 1000 << " milliseconds" << endl;
  //////////////////////////////////////////////////////


  //////////////////////////////////////////////////////
  // Copy again
  // Time a memcopy between cuda managed array, and regular array.
  start = high_resolution_clock::now();
  memcpy(myValues, myValues3, sizeof(float) * numOfFloatsForFourGigs);
  end = high_resolution_clock::now();

  duration = duration_cast<microseconds>(end - start);
  cout << "Second Memcpy Duration: " << duration.count() / 1000 << " milliseconds" << endl;
  //////////////////////////////////////////////////////

  return 0;
}

Here is the code for the pagelock attempt.

%%writefile memcopytest.cu

#include <iostream>

#include <chrono>
#include "cuda_fp16.h"


#include <sys/mman.h>


using namespace std;
using namespace std::chrono;



#define TWO_GIGS (2U*1024U*1024U*1024U)

int main()
{
  unsigned int numOfFloatsForFourGigs = TWO_GIGS / sizeof(float);

  float *myValues = new float[numOfFloatsForFourGigs];
  float *myValues2 = NULL;
  float *myValues3 = NULL;

  cudaMallocHost(&myValues2, sizeof(float) * numOfFloatsForFourGigs);
  cudaMallocHost(&myValues3, sizeof(float) * numOfFloatsForFourGigs);

  /////////////////////////////////////////////////////
  // Time a memcopy between 2 cuda managed arrays
  auto start = high_resolution_clock::now();
  memcpy(myValues2, myValues3, sizeof(float) * numOfFloatsForFourGigs);
  auto end = high_resolution_clock::now();

  auto duration = duration_cast<microseconds>(end - start);
  cout << "Managed Memory Memcpy Duration: " << duration.count() / 1000 << " milliseconds" << endl;
  //////////////////////////////////////////////////////


  //////////////////////////////////////////////////////
  // Try mlock to see if page locked memory speeds up memcpy
  mlock(myValues, sizeof(float) * numOfFloatsForFourGigs);

  // Time a memcopy between cuda managed array, and regular array.
  start = high_resolution_clock::now();
  memcpy(myValues, myValues3, sizeof(float) * numOfFloatsForFourGigs);
  end = high_resolution_clock::now();

  duration = duration_cast<microseconds>(end - start);
  cout << "Memlock Memcpy Duration: " << duration.count() / 1000 << " milliseconds" << endl;
  //////////////////////////////////////////////////////


  //////////////////////////////////////////////////////
  // Time a memcopy between cuda managed array, and regular array.
  start = high_resolution_clock::now();
  memcpy(myValues, myValues3, sizeof(float) * numOfFloatsForFourGigs);
  end = high_resolution_clock::now();

  duration = duration_cast<microseconds>(end - start);
  cout << "Regular Memcpy Duration: " << duration.count() / 1000 << " milliseconds" << endl;
  //////////////////////////////////////////////////////


  return 0;
}

Results for test 1

==12191== NVPROF is profiling process 12191, command: ./memcopytest
Managed Memory Memcpy Duration: 406 milliseconds
Regular Memcpy Duration: 1410 milliseconds
Second Memcpy Duration: 401 milliseconds
==12191== Profiling application: ./memcopytest
==12191== Profiling result:
No kernels were profiled.
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
      API calls:   99.99%  2.16435s         2  1.08218s  1.00506s  1.15929s  cudaHostAlloc
                    0.01%  166.68us       114  1.4620us     102ns  65.721us  cuDeviceGetAttribute
                    0.00%  10.923us         1  10.923us  10.923us  10.923us  cuDeviceGetName
                    0.00%  5.2240us         1  5.2240us  5.2240us  5.2240us  cuDeviceGetPCIBusId
                    0.00%  1.2260us         3     408ns     145ns     885ns  cuDeviceGetCount
                    0.00%     716ns         2     358ns     116ns     600ns  cuDeviceGet
                    0.00%     412ns         1     412ns     412ns     412ns  cuModuleGetLoadingMode
                    0.00%     397ns         1     397ns     397ns     397ns  cuDeviceTotalMem
                    0.00%     217ns         1     217ns     217ns     217ns  cuDeviceGetUuid

results for test 2

==11924== NVPROF is profiling process 11924, command: ./memcopytest
Managed Memory Memcpy Duration: 401 milliseconds
Memlock Memcpy Duration: 1413 milliseconds
Regular Memcpy Duration: 405 milliseconds
==11924== Profiling application: ./memcopytest
==11924== Profiling result:
No kernels were profiled.
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
      API calls:   99.99%  2.15809s         2  1.07905s  999.56ms  1.15853s  cudaHostAlloc
                    0.01%  138.46us       114  1.2140us     103ns  61.015us  cuDeviceGetAttribute
                    0.00%  11.250us         1  11.250us  11.250us  11.250us  cuDeviceGetName
                    0.00%  5.3160us         1  5.3160us  5.3160us  5.3160us  cuDeviceGetPCIBusId
                    0.00%  1.6700us         3     556ns     149ns  1.3090us  cuDeviceGetCount
                    0.00%     792ns         2     396ns     149ns     643ns  cuDeviceGet
                    0.00%     566ns         1     566ns     566ns     566ns  cuDeviceTotalMem
                    0.00%     480ns         1     480ns     480ns     480ns  cuModuleGetLoadingMode
                    0.00%     248ns         1     248ns     248ns     248ns  cuDeviceGetUuid

Thanks for the help!!

Robert_Crovella · March 6, 2025, 2:39pm

when I run your first code example on a L40 GPU, CUDA 12.2, I get:

Managed Memory Memcpy Duration: 324 milliseconds
Regular Memcpy Duration: 1637 milliseconds
Second Memcpy Duration: 501 milliseconds

when you do new or malloc() in host code, modern operating systems may do a lazy allocation. As a result, the first time you “touch” (actually read from, or write to) that allocated space, the OS has additional work to do. This slows down the first access to those locations. The additional work here may not be once-per-allocation, but instead once-per-page, in a demand-paged virtual memory system, typical of modern OS’s.

I think this is probably the explanation for the bulk of the difference between the first two lines of the output and why the 3rd line is much closer to the first. I probably won’t be able to explain the difference between the first and 3rd lines. It’s possible that there is something about the caching characteristics or paging characteristics of pinned allocation that makes it slightly faster to access than “ordinary” host memory.

It seems evident that the process of creating a pinned allocation must “force” the OS to complete the allocation steps. That is more-or-less the definition of pinned memory - pages that are always physically instantiated.

As an aside, CUDA has a fairly specific definition of what “managed” memory means, and this isn’t it. So it may be confusing to some readers to describe it that way.

alteredmaze · March 8, 2025, 4:06pm

You, my friend, are a legend!! Thanks so much for the quick answer.

Yeah, I believe lazy allocation is the case. I just added a memset and sure enough, the times are now similar. Thanks!!

I owe you a coffee if we ever meet IRL :D

system · March 22, 2025, 4:07pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Does the page-lock memory by cudaHostRegister slow than cudaMallocHost? CUDA Programming and Performance	9	982	June 30, 2023
Difference between cudaMallocManaged and cudaMallocHost CUDA Programming and Performance cuda	3	16263	March 30, 2022
Managed memory vs cudaHostAlloc - TK1 Jetson TK1	6	2130	February 15, 2016
uncached memory created by cudaHostAlloc and cudaMemcpyAsync issues on TX1 Jetson TX1	3	1824	July 15, 2016
Managed memory vs cudaHostAlloc - TK1 CUDA Programming and Performance	10	6291	February 22, 2016
Memcpy() versus cudaMemCpy() CUDA Programming and Performance	7	10958	September 29, 2021
Is cudaMallocmanaged with cudaMemAttachHost flag is faster than malloc? Jetson TK1	1	1459	February 15, 2016
CUDA kernel 10x slower when operating on cudaMallocManaged memory even when prefetched CUDA Programming and Performance	2	448	March 24, 2023
Managed Memory is slower on host than memory allocated with new[] Jetson AGX Orin cuda	6	793	September 11, 2023
Low performance for CPU accessing page-locked memory? CUDA Programming and Performance	3	695	March 7, 2019

Memcopy between two cuda managed addresses much faster than between cuda managed and not managed

Related topics