Memcpy() versus cudaMemCpy()

Hi.

Out of interest …

cudaMallocManaged((void**)&d_myData, sizeof(myData));
cudaMemcpy(d_myData, myData, sizeof(myData), cudaMemcpyHostToDevice); // (1)
memcpy(d_myData, myData, sizeof(myData)); // (2)

Is there any advantage/disadvantage in using memcpy() (2) i.s.o. cudaMemCpy() (1) to copy data to UM?

Thanks,

My preference would be to use the managed allocation for myData as opposed to anything you are showing here.

Regarding what you have, the allocation corresponding to d_myData does not exist anywhere immediately after the cudaMallocManaged call. So it’s not clear to me why one would want to do a HostToDevice operation here. But I suppose it might possibly depend on what you are doing after this point.

For this particular example, my suggestion would be to use the memcpy realization and then, later, if you need the allocation to be resident on a particular device, use the provided migration function for that (cudaMemPrefetchAsync). I generally try to avoid mixing the cudaMemcpy API with managed allocations.

1 Like

Agreed. I changed myData into a managed variable.

struct MyData {
    uint32_t x[512];
	... } myData[];

cudaMallocManaged((void**)&myData, sizeof(MyData) * 1);
myData[] = {
//    { { 0 }, 0x0fd28ea855 },
    { { 0 }, 0x0123456789 }
};

But cudaMallocManaged() needs to know the memory size of myData[] device array, which can only be hard-coded afterwards. So I need to change the 1 in cudaMallocManaged() everytime I change the number of elements in myData[] device array. That makes 2 synchronized modifications for one change in # of array elements.

I tried to avoid this by using a myData[] host array, where the memory size is set automatically - then myData[] host array (memory size is known now) is copied to d_myData[] device array. This avoids 2 code modifications for one change.

But this is more C than CUDA.

Your code is not correct. You are correct this requires understanding of how to program in C.

When using a cuda dynamic allocator (e.g. cudaMalloc, cudaMallocManaged, cudaHostAlloc, etc.) the first argument should be the address of a pointer variable and nothing else. This is evident from the function prototype in the reference manual.

&myData does not fit that description. You do not take the address of a static host allocation, and use that as the first argument of a cuda allocator. In effect you are taking a struct allocated in host memory, and reusing the first 8 bytes of that struct as storage for a pointer. That isn’t right.

Beyond that, it would be important to know what you objective is. Let’s start out with storage. Do you want storage on the device for:

  1. an array of uint32_t that has a length that may be different than 512?
  2. an array of structs, where each struct has an internal array of uint32_t that is exactly 512 elements long?
  3. something else?

I simplified the code above (too much) and omitted the bracket pair behind the myData (after the struct definition). I corrected. It’s a pointer. The code as such works.

Let’s start again. I’d like to make struct myData[] (array of structs), available to host and device.

struct MyData {
    uint32_t x[512];
	uint64_t y
};

Usually (host only), you define a variable and allocate memory by initializing.

The host-only version:

struct MyData { // declare
    uint32_t x[512];
	uint64_t y
};
struct MyData myData[] = { // initialize
    { { 0 }, 0xffffffffffffffff }
};

For device code using cudaMallocManaged(), this is not possible since memory allocation initialization cannot be done in one step using the initialization syntax above.

That’s why I used the host array myData[] and memcpy() to first create the host variable, then transfer the data to the device variable d_myData[].

I guess this is probably not the best way to do it.

in C or C++, the things you are showing here are referred to as static allocations. They don’t depend on a dynamic allocator such as malloc or new in order to provide storage space.

Managed memory has both static and dynamic allocation methods. cudaMallocManaged is the dynamic allocator, whereas __managed__ is the decorator used to provide a static allocation (but still managed).

To adhere most closely syntactically to what you have shown, the natural choice is to use a static (managed) allocation scheme:

$ cat t1884.cu
#include <cstdio>
#include <cstdint>
struct MyData { // declare
    uint32_t x[512];
        uint64_t y;
};
__managed__ struct MyData myData[] = { // initialize
    { { 0 }, 0xffffffffffffffff }
};

__global__ void k(struct MyData *d){

  printf("%u, %lu\n", d[0].x[0], d[0].y);
}

int main(){
  printf("from host code:\n");
  printf("%u, %lu\n", myData[0].x[0], myData[0].y);
  printf("from device code:\n");
  k<<<1,1>>>(myData);
  cudaDeviceSynchronize();
}

$ nvcc -o t1884 t1884.cu
$ compute-sanitizer ./t1884
========= COMPUTE-SANITIZER
from host code:
0, 18446744073709551615
from device code:
0, 18446744073709551615
========= ERROR SUMMARY: 0 errors
$

Of course, we could realize something similar using cudaMallocManaged, but it will look a bit different, just as your host code would look different if you used a dynamic allocator to provide storage for myData.

One of many possible limitations of using the __managed__ decorator is that variables of that type must live at global scope. If that or any other limitation is problematic, then the usual advice is to switch to a dynamic allocation scheme using cudaMallocManaged. There are many possible realizations here, but it could look like this:

$ cat t1884.cu
#include <cstdio>
#include <cstdint>
struct MyData { // declare
    uint32_t x[512];
        uint64_t y;
};

__global__ void k(struct MyData *d){

  printf("%u, %lu\n", d[0].x[0], d[0].y);
}

int main(){
  int array_size = 1;
  struct MyData *myData;
  cudaMallocManaged(&myData, array_size*sizeof(MyData));
  myData[0] =  { { 0 }, 0xffffffffffffffff };
  printf("from host code:\n");
  printf("%u, %lu\n", myData[0].x[0], myData[0].y);
  printf("from device code:\n");
  k<<<1,1>>>(myData);
  cudaDeviceSynchronize();
}

$ nvcc -o t1884 t1884.cu
$ compute-sanitizer ./t1884
========= COMPUTE-SANITIZER
from host code:
0, 18446744073709551615
from device code:
0, 18446744073709551615
========= ERROR SUMMARY: 0 errors
$
1 Like