Hi, I’m new to CUDA and been writing a code where kernel should generate and store values in an empty array of arrays - no copying data from host.
Most examples I’ve seen allocate device arrays (with cudaMalloc) in a for loop and store pointers to these in a host array of pointers - after that they cudaMemcpy this array to a (previously allocated) device array of pointers. I guess these extra steps might have something to do with CC <5.0
I wanted to try a bit different approach by first allocating device array of pointers (d_histRx) using cudaMallocManaged and then, in a for loop allocating arrays on device, storing the pointer in a helper pointer variable, and cudaMemcpy-ing them to the appropriate index of d_histRx. I tried copying the size of the whole array, passing a reference (and some other variations I could think of) but they all throw the same error: cudaErrorInvalidValue (1)
Here is that part of my code:
unsigned int histlen = 3;
unsigned int particle_N_ = 16;
float** d_histRx
cudaError_t errstat
float* helper
errstat = cudaMallocManaged((void**)&d_histRx, histlen * sizeof(float*));
for (size_t i = 0; i < histlen; i++)
{
errstat = cudaMallocManaged((void**)&helper, particle_N_ * sizeof(float));
assert(errstat == cudaSuccess);
errstat = cudaMemcpy(d_histRx[i], helper, particle_N_ * sizeof(float), cudaMemcpyDeviceToDevice);
assert(errstat == cudaSuccess); //throws the error
}
I know this is not useful code per se - I did also just cudaMallocManaged directly to the d_histRx[i] and it works, like any other C malloc would.
But the mentioned examples confused me and now I really want to know why my first approach did not work - even more so in case I ever need to copy a device array to an array of arrays. What am I missing?