Array pointing to multiple arrays on the device

I have a CPU code that I am trying to set up on the GPU which involves an array of pointers where each address in the array points to one of two different arrays. Very succinctly,

double ** ptrArray_h = new double * [size];

ptrArray_h[i] = &array1_h[some index]
ptrArray_h[i+1] = &array2_h[some other index]

Then, I can do this when using the ptrArray:

value = *ptrArray[index].

I would like to do this on the device and am aware that my pointer array can not be simply copied as is because the device arrays (array1_d and array2_d) have different memory addresses. Typically I would advocate against pointer arrays but I’m not certain how I could go about doing this without this type of indirection. If there is a cleaner way to do this, I’m open to it.

It is not clear what the question is. CUDA is a subset of C++, so

(1) You can create the same kind of data structures on the device that you create on the host, including an array of pointers to arrays of elements of type T.
(2) You can copy data structures like an array of pointers to arrays of elements of type T just like you normally do in C++, by performing a deep copy.

Some people like to represent 2D matrices as an array of pointers to 1D vectors of T (not something I wold advise, for performance reasons). Here is an example of how such a matrix can be moved between host and device. Your use case seems closely related.

#include <cstdio>
#include <cstdlib>
#include <complex>
#include "cuComplex.h"

#define N  (2)
#define M  (3)

typedef std::complex<float> T;

__global__ void print_device_matrix (cuComplex** mat)
{
    printf ("matrix on device:\n");
    for (int i = 0; i < N; i++) {
        for (int j = 0; j < M; j++) {
            printf ("(%f, %f)  ", cuCrealf (mat[i][j]), cuCimagf (mat[i][j]));
        }
        printf ("\n");
    }
}

int main (void)
{
    /* allocate host "matrix" */
    T **mat = (T**)malloc (N * sizeof (mat[0]));
    for (int i = 0; i < N; i++) {
        mat[i] = (T *)malloc (M * sizeof (mat[0][0]));
    }
    
    /* fill in host "matrix" */
    for (int i = 0; i < N; i++) {
        for (int j = 0; j < M; j++) {
            mat[i][j] = T (float(i)+1, float(j)+1);
        }
    }

    /* print host "matrix" */
    printf ("matrix on host:\n");
    for (int i = 0; i < N; i++) {
        for (int j = 0; j < M; j++) {
            printf ("(%f, %f)  ", real(mat[i][j]), imag(mat[i][j]));
        }
        printf ("\n");
    }

    /* allocate device "matrix" */
    T **tmp = (T**)malloc (N * sizeof (tmp[0]));
    for (int i = 0; i < N; i++) {
        cudaMalloc ((void **)&tmp[i], M * sizeof (tmp[0][0]));
    }
    cuComplex **matD = 0;
    cudaMalloc ((void **)&matD, N * sizeof (matD[0]));

    /* copy "matrix" from host to device */
    cudaMemcpy (matD, tmp, N * sizeof (matD[0]), cudaMemcpyHostToDevice);
    for (int i = 0; i < N; i++) {
        cudaMemcpy (tmp[i], mat[i], M * sizeof (matD[0][0]), cudaMemcpyHostToDevice);
    }
    free (tmp);

    /* print device "matrix" */
    print_device_matrix<<<1,1>>> (matD);

    /* free host "matrix" */
    for (int i = 0; i < N; i++) {
        free (mat[i]);
    }
    free (mat);
    
    /* free device "matrix" */
    tmp = (T**)malloc (N * sizeof (tmp[0]));
    cudaMemcpy (tmp, matD, N * sizeof (matD[0]), cudaMemcpyDeviceToHost);
    for (int i = 0; i < N; i++) {
        cudaFree (tmp[i]);
    }
    free (tmp);
    cudaFree (matD);

    return EXIT_SUCCESS;
}

I’ll have to study your code. Thank you for the response. I’m just unsure how this works with the references to two different arrays. If I can create the data structure on the device then fair enough - I can do that.

In my example, each row of the 2D matrix is stored in a separate array of T. The matrix itself is presented by an array of row pointers. Unless I overlooked something, your use case is a 2D matrix with two rows called array1 and array2, and your array of row pointers is called ptrArray.

this may be of interest

I think maybe I’m doing a poor job of explaining what I’m doing. I also may just not be following both of your recommendations very well. Allow me to try again.

Array1_h is of size Ni
Array2_h is of size Np

My ptrArray is pointing to particular addresses in each of these separately allocated arrays:

ptrArray_h[i] = &array1_h[some index]
ptrArray_h[i+1] = &array2_h[some other index]

So the ptrArray at element i is pointing to a single memory reference in either array1 or array2. This way, when I access ptrArray[index] I’m getting a memory address for either array1_h or array2_h. When I dereference ptrArray, i.e. *ptrArray[i] it will give me the value in array1_h or array2_h. In this case, we can be somewhat literal. In my above code I said that *ptrArray_h[i] will give me the value of array1_h[some index]. If I had asked for *ptrArray[i+1] I would get the value of array2_h[some other index].

I guess I’m not entirely sure that I have just a plain 2D array here. I have two separately allocated arrays that need to be conditionally called based on this data structure. I need to be able to access these two different arrays allocated on my device through this ptrArray which in my mind is just a clever way of getting away from having to do an if statement on the device:

if ( index == i ): val = array1_h[some index]
else: val = array2_h[some other index]

Does this make some sense? If it does and your recommendations still hold, I will study them and see if I can make sense of them. Thanks for the quick replies.

Hi,
It seems more like a C question than CUDA.

Here is one idea:

Assuming that both arrays use the same data type, just allocate a single array of size Ni + Np.
Then use an index mapping array which is an array which contains indexes to the unified array.
Now you can do any type of arbitrary mapping

It would have been good to mention that at the start. So my understanding was mistaken: this is not a 2D matrix-like data structure. But: I does not matter what it is. Any data structure you can construct using pointers on the host, you can likewise construct on the device. And you can copy between them using an appropriate deep copy process. A list can be copied, a tree can be copied, a DAG can be copied.

The arrays need to be separate because they are communicated via MPI_Send and MPI_Recv commands differently.

This is what I’m going to need to do a bit of reading on. I want to just check one last thing before I give the check-mark to your solution.

I have array1_h which I then (deep?) copy to device to array1_d
I have array2_h which I then (deep?) copy to device to array1_d

I have ptrArray_h which I need to deep copy to the device to ptrArray_d.

If I want *ptrArray_d[i] to give me the value of array1_d[some index] then deep copying array(1/2)_h to array(1/2)_d is also required?

I really appreciate your help pointing me to the terms needed to solve this problem.