Passing libcu++'s arrays between the host and the device

antoine59 · June 12, 2024, 4:25pm

Hi, I’m new using CUDA and I’ve got a hard time understanding how to use cuda::std::array which is supposed to handle being passed between the host and the device. Surprisingly, I found close to no information on the web, no one seems to be using libcu++'s arrays.

Here is a MWE where I create an array of two floats x that I initialise with the value 1.0f. I then pass this array to a kernel assigning the value 2.0f to the first element of the array.

#include <iostream>
#include <algorithm>
#include <cuda/std/array>

__global__ 
void foo(cuda::std::array<float, 2> x)
{
    x[0] = 2.0f;
}

int main(void)
{
    cuda::std::array<float, 2> x;
    // Initialises the array with the value 1.0f
    std::transform(x.begin(), x.end(), x.begin(), [](...) { return 1.0f; });

    // Displays the array
    std::cout << "Before: ";
    std::for_each(x.begin(), x.end(), [](auto x) { std::cout << x << " "; });
    std::cout << "\n";

    // Calls the kernel with the arrray
    foo<<<1,1>>>(x);

    // Waits for synchronisation
    cudaDeviceSynchronize();

    // Displays the array again
    std::cout << "After:  ";
    std::for_each(x.begin(), x.end(), [](auto x) { std::cout << x << " "; });
    std::cout << "\n";
}

Unfortunately, the resulting array seems to be unchanged according to the output I get:

Before: 1 1
After:  1 1

I made sure to check any CUDA error returned by cudaDeviceSynchronize() but there is none.

Also, I’m relatively new to C++ and I checked the array’s implementation in order to get a better understanding of how the memory is managed. But I found that it just used the implicitly-defined constructor for aggregate types. Where is the Unified Memory involved? There must be some call to cudaMallocManaged somewhere, right?

Note that I have a GeForce 1070Ti whose compute capability is 6.1 and I use CUDA 12.5.

Edit: I think that the kernel is, in fact, not even run. I tried putting invalid statements like accessing invalid x locations but cudaDeviceSynchronize() still does not return any error.

striker159 · June 12, 2024, 5:52pm

You did pass the array correctly between host and device and you were able to use the array in the kernel. So the array works as expected.

However, cuda::std::array does not allocate heap storage for its elements. It is a fixed-size stack array just like its host-only equivalent std::array. If that is passed to the kernel, a copy is created, i.e. the kernel modifies a copy of the array and leaves the original untouched. (Your kernel is probably optimized away to a no-op because it does not change observable state).

You could, of course, allocate managed memory yourself and put an array inside, see code below.


#include <iostream>
#include <algorithm>
#include <cuda/std/array>
#include <cassert>

using MyArray = cuda::std::array<float, 2>;

__global__
void foo(MyArray* x)
{
    (*x)[0] = 2.0f;
}

int main(void)
{
    cudaError_t status = cudaSuccess;
    MyArray* x_managed;

    //allocate memory accessible by both host and device which is large enough to store MyArray
    status = cudaMallocManaged(&x_managed, sizeof(MyArray));
    assert(status == cudaSuccess);

    //construct an instance of MyArray in that memory
    new (x_managed) MyArray();

    // Initialises the array with the value 1.0f
    std::transform(x_managed->begin(), x_managed->end(), x_managed->begin(), [](...) { return 1.0f; });

    // Displays the array
    std::cout << "Before: ";
    std::for_each(x_managed->begin(), x_managed->end(), [](auto x) { std::cout << x << " "; });
    std::cout << "\n";

    // Calls the kernel with the arrray
    foo<<<1,1>>>(x_managed);
    status = cudaGetLastError();
    assert(status == cudaSuccess);

    // Waits for synchronisation
    status = cudaDeviceSynchronize();
    assert(status == cudaSuccess);

    // Displays the array again
    std::cout << "After:  ";
    std::for_each(x_managed->begin(), x_managed->end(), [](auto x) { std::cout << x << " "; });
    std::cout << "\n";

    //destroy the MyArray instance
    x_managed->~MyArray();

    //free the memory
    status = cudaFree(x_managed);
    assert(status == cudaSuccess);
}

Before: 1 1 
After:  2 1

antoine59 · June 12, 2024, 8:07pm

Thank you! It was indeed a pretty basic misunderstanding.

system · June 26, 2024, 8:08pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Passing a dynamic array in C++ to CUDA kernel CUDA Programming and Performance	5	5520	December 4, 2013
Copying allocated arrays to an array of arrays/pointers CUDA Programming and Performance	2	108	May 12, 2025
How to implement a generic array in unified memory? CUDA Programming and Performance	7	2120	July 2, 2017
passing array values change CUDA Programming and Performance	0	2947	May 9, 2008
Using unified memory for 2Dim and 3 Dim array CUDA Programming and Performance	2	575	November 18, 2018
passing an array to a kenel ? CUDA Programming and Performance	9	13731	June 10, 2009
need some help with cudaMemcpy/cudamemcpy2D CUDA Programming and Performance	2	2052	June 9, 2010
CUDA array host to device CUDA Programming and Performance	11	6178	February 19, 2008
2d array testing in very simple code using CUDA CUDA Programming and Performance	29	30581	November 15, 2010
Passing Pointers between .cpp and .cu files CUDA Programming and Performance	1	3134	April 21, 2011

Passing libcu++'s arrays between the host and the device

Related topics