Copying allocated arrays to an array of arrays/pointers

Hi, I’m new to CUDA and been writing a code where kernel should generate and store values in an empty array of arrays - no copying data from host.

Most examples I’ve seen allocate device arrays (with cudaMalloc) in a for loop and store pointers to these in a host array of pointers - after that they cudaMemcpy this array to a (previously allocated) device array of pointers. I guess these extra steps might have something to do with CC <5.0

I wanted to try a bit different approach by first allocating device array of pointers (d_histRx) using cudaMallocManaged and then, in a for loop allocating arrays on device, storing the pointer in a helper pointer variable, and cudaMemcpy-ing them to the appropriate index of d_histRx. I tried copying the size of the whole array, passing a reference (and some other variations I could think of) but they all throw the same error: cudaErrorInvalidValue (1)
Here is that part of my code:

unsigned int histlen = 3;
unsigned int particle_N_ = 16;
float** d_histRx
cudaError_t errstat
float* helper
errstat = cudaMallocManaged((void**)&d_histRx, histlen * sizeof(float*));
for (size_t i = 0; i < histlen; i++)
    {
        errstat = cudaMallocManaged((void**)&helper, particle_N_ * sizeof(float));
        assert(errstat == cudaSuccess);
        errstat = cudaMemcpy(d_histRx[i], helper, particle_N_ * sizeof(float), cudaMemcpyDeviceToDevice);
        assert(errstat == cudaSuccess); //throws the error
    }

I know this is not useful code per se - I did also just cudaMallocManaged directly to the d_histRx[i] and it works, like any other C malloc would.
But the mentioned examples confused me and now I really want to know why my first approach did not work - even more so in case I ever need to copy a device array to an array of arrays. What am I missing?

when you allocate space here:

You are allocating space for an array of pointers. That is storage space to hold pointer values. There are no values in there after the allocation, or if you prefer you could imagine it is initially an array of null pointers - all zeros. (It’s not all zeros, but neither is it any useful value, so the behavior is equivalent to the behavior as if it were all zeros.) Bottom line, d_histRx points to “empty”, uninitialized space. So let’s pretend it is all zeros.

Now what happens when you do this:

That says, copy float data of length particle_N_ from a location pointed to by the value held in the helper pointer variable, to the location pointed to by the value held in the d_histRx[i] pointer variable. But d_histRx[i] (for every i), is empty, uninitialized space. As if it were zero. So you can’t copy to the location pointed to by a null pointer (the pointer whose numerical value is zero). And if you don’t like calling it zero, then imagine it is a pointer whose numerical value is a random, garbage number. You can’t (in the general case) copy to that location either.

The cudaMemcpy API call evidently checks the pointers you pass it, to make sure they are in a space corresponding to their intent (source or destination) using the specified transfer direction cudaMemcpyDeviceToDevice.

What you might be able to do (I haven’t tried it, but I think it would work) is to just copy&& the pointer value itself:

    errstat = cudaMemcpy(d_histRx+i, &helper, sizeof(float*), cudaMemcpyHostToDevice);

I should point out for future readers that I wouldn’t recommend this allocation scheme anyway, for reasons that I believe are widely discussed: pointer chasing is not the most efficient way to index into a multidimensional array, and furthermore this creates an unwieldy group of sub-allocations that have to be allocated, used&, and freed independently. There are simpler and more efficient approaches. But I gather your intent with this question is to study and learn, which is fine, of course.

(& Yes, some usage patterns won’t appear to be independent. But relatively simple usage patterns like trying to index through an array using pointer arithmetic, for example to go from one row to the next, may not work.)

(&& in fact, since both locations are accessible from host code, you could do:

memcpy(d_histRx+i, &helper, sizeof(float*));

)

Thank you very much for the detailed answer!
My initial attempt was actually passing data (pointer) stating just “sizeof(float*)” but forgetting that cudaMemcpy needs addresses (not the data) from & to where to copy, and that is in this case from &helper to d_hist+i as you wrote.
Not sure why I got stuck with that misconception, I am familiar with nullptr and that it can’t be written to - but at least I got try out all possible mistakes with these functions at once while tweaking.

Thanks also for the resources on multidimensional arrays, I’ll be looking into them in detail, and I was not aware of some seemingly newer functions (like cudaMallocPitch). Book I’ve read only mentions pointer arithmetic with the linear memory. My idea with this was actually to have a way to gather history of changes to a 1D array with unpredictable number of “logs” before copying it to host in batch - but considering the limited device memory I guess it is wiser to predict a certain number of “logs” in accordance with available memory and trigger (asynchronous) copying to host when full, free it and re-allocate it.