a problem with double pointer

I met problem in using double pointer in CUDA. The input is a a stack of 2D arrays or a 3D array, its size RowsColsn_A, I want to store it on device using a set of 2D arrays (flatten in linear memory). My code fragment is just like this:

// on device memory

  cuComplex** d_ptrA;

  cuComplex** d_ptrB;

.....

 // evaluate h_A  (data on host memory, its a stack of 2D arrays or a 3D array, its size Rows*Cols*n_A)

....

// allocate double pointer

    cudaMalloc((void**) &d_ptrA, n_A*sizeof(cuComplex*));

    checkCudaError("at cudaMalloc1");

// allocate a set of 2D cuda array to store the first matrix stack

    for (int i = 0; i < n_A; i++)

    {

        cudaMalloc((void**)d_ptrA[i], Rows*Cols * sizeof(cuComplex));

        checkCudaError("at cudaMalloc2");

        cudaMemcpy(d_ptrA[i], h_A+i*Rows*Cols,Rows*Cols*sizeof(cuComplex),cudaMemcpyHostToDevice);

        checkCudaError("at cudaMemcpy1");

    }

But I always receive segment violation which cause Matlab crashes. Can anyone help me out?

Based on the first call to cudaMalloc(), you almost certainly want

// on device memory

  cuComplex* d_ptrA;

  cuComplex* d_ptrB;

Given that d_ptrA is a pointer to device memory, dereferencing such a pointer in host code will tend to cause a crash:

cudaMalloc((void**)d_ptrA[i], ...)

Thank you for your reply but I think you misunderstood me. I allocate d_ptrA to store many pointers. That is , its elements are pointers. That is why cuComplex** d_ptrA.

Yes, I misunderstood. Dereferencing a pointer to device memory on the host side will not work, however. The cudaMalloc() call delivers a pointer to device memory, into host-side storage. d_ptrA[i] is device-side storage.

So you mean I should store those pointers in host side?

I donot know whether it makes sense:

cuComplex** ptrA;

    ptrA = (cuComplex**)malloc(sizeof(cuComplex*)*n_A);

    // allocate a set of 2D cuda array to store the first matrix stack

    for (int i = 0; i < n_A; i++)

    {

        cudaMalloc((void**)d_ptrA, Rows*Cols * sizeof(cuComplex));

        checkCudaError("at cudaMalloc3");

        cudaMemcpy(d_ptrA,h_A+i*Rows*Cols,Rows*Cols*sizeof(cuComplex),cudaMemcpyHostToDevice);

        checkCudaError("at cudaMemcpy1");

        ptrA[i] = d_ptrA;

    }

I would say you have to store these pointers in device memory. But then you need a kernel to fill the array as your host cannot access it.

But wouldn’t it be simpler to have just one continuous piece of memory? Why scatter it over many small parts?

Oh, that is some reason that I try to use a set of arrays.

I have two sets of images as input. At first, I read these two sets of images to two 3D textures, and then read a pair of images from the two 3D textures to do some rotation, fft…

But you know 3D texture has a size limit: 204820482048 which means if the image number in a set>2048, error will happen. So I want to try some other ways.

I have tried using continuous piece of memory and read one image size out and bind to a 2D texture in loop… but much slower than 3D texture method.

So I want to try allocating a set of arrays or a set of 2D textures to sore a set of images. But needs to store their addresses, that is why double pointer… but always fails.

Maybe I can try using a kernel to fill the arrays just as you said.