Memcpy() versus cudaMemCpy()

geohei · August 30, 2021, 3:31pm

Hi.

Out of interest …

cudaMallocManaged((void**)&d_myData, sizeof(myData));
cudaMemcpy(d_myData, myData, sizeof(myData), cudaMemcpyHostToDevice); // (1)
memcpy(d_myData, myData, sizeof(myData)); // (2)

Is there any advantage/disadvantage in using memcpy() (2) i.s.o. cudaMemCpy() (1) to copy data to UM?

Thanks,

Robert_Crovella · August 30, 2021, 4:52pm

My preference would be to use the managed allocation for myData as opposed to anything you are showing here.

Regarding what you have, the allocation corresponding to d_myData does not exist anywhere immediately after the cudaMallocManaged call. So it’s not clear to me why one would want to do a HostToDevice operation here. But I suppose it might possibly depend on what you are doing after this point.

For this particular example, my suggestion would be to use the memcpy realization and then, later, if you need the allocation to be resident on a particular device, use the provided migration function for that (cudaMemPrefetchAsync). I generally try to avoid mixing the cudaMemcpy API with managed allocations.

geohei · August 31, 2021, 1:21pm

Agreed. I changed myData into a managed variable.

struct MyData {
    uint32_t x[512];
	... } myData[];

cudaMallocManaged((void**)&myData, sizeof(MyData) * 1);
myData[] = {
//    { { 0 }, 0x0fd28ea855 },
    { { 0 }, 0x0123456789 }
};

But cudaMallocManaged() needs to know the memory size of myData[] device array, which can only be hard-coded afterwards. So I need to change the 1 in cudaMallocManaged() everytime I change the number of elements in myData[] device array. That makes 2 synchronized modifications for one change in # of array elements.

I tried to avoid this by using a myData[] host array, where the memory size is set automatically - then myData[] host array (memory size is known now) is copied to d_myData[] device array. This avoids 2 code modifications for one change.

But this is more C than CUDA.

Robert_Crovella · August 31, 2021, 2:16pm

Your code is not correct. You are correct this requires understanding of how to program in C.

When using a cuda dynamic allocator (e.g. cudaMalloc, cudaMallocManaged, cudaHostAlloc, etc.) the first argument should be the address of a pointer variable and nothing else. This is evident from the function prototype in the reference manual.

&myData does not fit that description. You do not take the address of a static host allocation, and use that as the first argument of a cuda allocator. In effect you are taking a struct allocated in host memory, and reusing the first 8 bytes of that struct as storage for a pointer. That isn’t right.

Beyond that, it would be important to know what you objective is. Let’s start out with storage. Do you want storage on the device for:

an array of uint32_t that has a length that may be different than 512?
an array of structs, where each struct has an internal array of uint32_t that is exactly 512 elements long?
something else?

geohei · September 2, 2021, 1:19pm

I simplified the code above (too much) and omitted the bracket pair behind the myData (after the struct definition). I corrected. It’s a pointer. The code as such works.

Let’s start again. I’d like to make struct myData[] (array of structs), available to host and device.

struct MyData {
    uint32_t x[512];
	uint64_t y
};

Usually (host only), you define a variable and allocate memory by initializing.

The host-only version:

struct MyData { // declare
    uint32_t x[512];
	uint64_t y
};
struct MyData myData[] = { // initialize
    { { 0 }, 0xffffffffffffffff }
};

For device code using cudaMallocManaged(), this is not possible since memory allocation initialization cannot be done in one step using the initialization syntax above.

That’s why I used the host array myData[] and memcpy() to first create the host variable, then transfer the data to the device variable d_myData[].

I guess this is probably not the best way to do it.

Robert_Crovella · September 2, 2021, 1:39pm

in C or C++, the things you are showing here are referred to as static allocations. They don’t depend on a dynamic allocator such as malloc or new in order to provide storage space.

Managed memory has both static and dynamic allocation methods. cudaMallocManaged is the dynamic allocator, whereas __managed__ is the decorator used to provide a static allocation (but still managed).

To adhere most closely syntactically to what you have shown, the natural choice is to use a static (managed) allocation scheme:

$ cat t1884.cu
#include <cstdio>
#include <cstdint>
struct MyData { // declare
    uint32_t x[512];
        uint64_t y;
};
__managed__ struct MyData myData[] = { // initialize
    { { 0 }, 0xffffffffffffffff }
};

__global__ void k(struct MyData *d){

  printf("%u, %lu\n", d[0].x[0], d[0].y);
}

int main(){
  printf("from host code:\n");
  printf("%u, %lu\n", myData[0].x[0], myData[0].y);
  printf("from device code:\n");
  k<<<1,1>>>(myData);
  cudaDeviceSynchronize();
}

$ nvcc -o t1884 t1884.cu
$ compute-sanitizer ./t1884
========= COMPUTE-SANITIZER
from host code:
0, 18446744073709551615
from device code:
0, 18446744073709551615
========= ERROR SUMMARY: 0 errors
$

Of course, we could realize something similar using cudaMallocManaged, but it will look a bit different, just as your host code would look different if you used a dynamic allocator to provide storage for myData.

One of many possible limitations of using the __managed__ decorator is that variables of that type must live at global scope. If that or any other limitation is problematic, then the usual advice is to switch to a dynamic allocation scheme using cudaMallocManaged. There are many possible realizations here, but it could look like this:

$ cat t1884.cu
#include <cstdio>
#include <cstdint>
struct MyData { // declare
    uint32_t x[512];
        uint64_t y;
};

__global__ void k(struct MyData *d){

  printf("%u, %lu\n", d[0].x[0], d[0].y);
}

int main(){
  int array_size = 1;
  struct MyData *myData;
  cudaMallocManaged(&myData, array_size*sizeof(MyData));
  myData[0] =  { { 0 }, 0xffffffffffffffff };
  printf("from host code:\n");
  printf("%u, %lu\n", myData[0].x[0], myData[0].y);
  printf("from device code:\n");
  k<<<1,1>>>(myData);
  cudaDeviceSynchronize();
}

$ nvcc -o t1884 t1884.cu
$ compute-sanitizer ./t1884
========= COMPUTE-SANITIZER
from host code:
0, 18446744073709551615
from device code:
0, 18446744073709551615
========= ERROR SUMMARY: 0 errors
$

geohei · September 29, 2021, 4:41pm

Many thanks.
I missed the __manage__ decorator so far. Works exactly as desired.
Your second code block (using cudaMallocManaged()) is closest to my initial code I used.
Both ways, it works!
Well explained!

system · April 3, 2024, 6:42am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
CudaMallocManaged and CudaMemcpy CUDA Programming and Performance	2	18821	May 29, 2016
a question about cudaMallocManaged（） CUDA Programming and Performance	4	630	November 17, 2018
Difference between cudaMallocManaged and cudaMallocHost CUDA Programming and Performance cuda	3	16112	March 30, 2022
Function abnormal when cudaMallocManaged without memcpy CUDA NVCC Compiler	0	107	June 11, 2024
Managed memory vs cudaHostAlloc - TK1 Jetson TK1	6	2130	February 15, 2016
Memcopy between two cuda managed addresses much faster than between cuda managed and not managed CUDA Programming and Performance	3	107	March 22, 2025
''cudaMemcpy'' failed to copy from device memory dynamically allocate using ''malloc'' CUDA Programming and Performance	5	567	October 25, 2022
CUDA kernel 10x slower when operating on cudaMallocManaged memory even when prefetched CUDA Programming and Performance	2	447	March 24, 2023
Managed memory vs cudaHostAlloc - TK1 CUDA Programming and Performance	10	6286	February 22, 2016
uncached memory created by cudaHostAlloc and cudaMemcpyAsync issues on TX1 Jetson TX1	3	1823	July 15, 2016

Memcpy() versus cudaMemCpy()

Related topics