Passing libcu++'s arrays between the host and the device

Hi, I’m new using CUDA and I’ve got a hard time understanding how to use cuda::std::array which is supposed to handle being passed between the host and the device. Surprisingly, I found close to no information on the web, no one seems to be using libcu++'s arrays.

Here is a MWE where I create an array of two floats x that I initialise with the value 1.0f. I then pass this array to a kernel assigning the value 2.0f to the first element of the array.

#include <iostream>
#include <algorithm>
#include <cuda/std/array>

__global__ 
void foo(cuda::std::array<float, 2> x)
{
    x[0] = 2.0f;
}

int main(void)
{
    cuda::std::array<float, 2> x;
    // Initialises the array with the value 1.0f
    std::transform(x.begin(), x.end(), x.begin(), [](...) { return 1.0f; });

    // Displays the array
    std::cout << "Before: ";
    std::for_each(x.begin(), x.end(), [](auto x) { std::cout << x << " "; });
    std::cout << "\n";

    // Calls the kernel with the arrray
    foo<<<1,1>>>(x);

    // Waits for synchronisation
    cudaDeviceSynchronize();

    // Displays the array again
    std::cout << "After:  ";
    std::for_each(x.begin(), x.end(), [](auto x) { std::cout << x << " "; });
    std::cout << "\n";
}

Unfortunately, the resulting array seems to be unchanged according to the output I get:

Before: 1 1
After:  1 1

I made sure to check any CUDA error returned by cudaDeviceSynchronize() but there is none.

Also, I’m relatively new to C++ and I checked the array’s implementation in order to get a better understanding of how the memory is managed. But I found that it just used the implicitly-defined constructor for aggregate types. Where is the Unified Memory involved? There must be some call to cudaMallocManaged somewhere, right?

Note that I have a GeForce 1070Ti whose compute capability is 6.1 and I use CUDA 12.5.

Edit: I think that the kernel is, in fact, not even run. I tried putting invalid statements like accessing invalid x locations but cudaDeviceSynchronize() still does not return any error.

You did pass the array correctly between host and device and you were able to use the array in the kernel. So the array works as expected.

However, cuda::std::array does not allocate heap storage for its elements. It is a fixed-size stack array just like its host-only equivalent std::array. If that is passed to the kernel, a copy is created, i.e. the kernel modifies a copy of the array and leaves the original untouched. (Your kernel is probably optimized away to a no-op because it does not change observable state).

You could, of course, allocate managed memory yourself and put an array inside, see code below.


#include <iostream>
#include <algorithm>
#include <cuda/std/array>
#include <cassert>

using MyArray = cuda::std::array<float, 2>;

__global__
void foo(MyArray* x)
{
    (*x)[0] = 2.0f;
}

int main(void)
{
    cudaError_t status = cudaSuccess;
    MyArray* x_managed;

    //allocate memory accessible by both host and device which is large enough to store MyArray
    status = cudaMallocManaged(&x_managed, sizeof(MyArray));
    assert(status == cudaSuccess);

    //construct an instance of MyArray in that memory
    new (x_managed) MyArray();

    // Initialises the array with the value 1.0f
    std::transform(x_managed->begin(), x_managed->end(), x_managed->begin(), [](...) { return 1.0f; });

    // Displays the array
    std::cout << "Before: ";
    std::for_each(x_managed->begin(), x_managed->end(), [](auto x) { std::cout << x << " "; });
    std::cout << "\n";

    // Calls the kernel with the arrray
    foo<<<1,1>>>(x_managed);
    status = cudaGetLastError();
    assert(status == cudaSuccess);

    // Waits for synchronisation
    status = cudaDeviceSynchronize();
    assert(status == cudaSuccess);

    // Displays the array again
    std::cout << "After:  ";
    std::for_each(x_managed->begin(), x_managed->end(), [](auto x) { std::cout << x << " "; });
    std::cout << "\n";

    //destroy the MyArray instance
    x_managed->~MyArray();

    //free the memory
    status = cudaFree(x_managed);
    assert(status == cudaSuccess);
}


Before: 1 1 
After:  2 1 

Thank you! It was indeed a pretty basic misunderstanding.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.