3D array and 3D volume indice in CUDA indices mapping How to map linear CUDA indices back to origina

Greetings,

I would very much appreciate if anybody could help me understanding the following:

Dealing with volumes represented by 3D arrays is very common task in scientific (parallel) computing.

Unfortunately after reading CUDA docs and this forum I still don’t understand or can come up with a formula that allows for map host 3D ijk indices to 1d LINEAR pitched CUDA

memory allocation (using CudaMalloc3D(), CudaMalloc3DArray(), or cudaMallocPitch() ).

Suppose I have a 3D array allocated on host like this:

[codebox]/* allocate a 3-d array */

void ***Alloc3 (size_t n1, size_t n2, size_t n3, size_t size)

{

    size_t i3,i2;

    void ***p;

if ((p=(void***)malloc(n3*sizeof(void**)))==NULL)

            return NULL;

    if ((p[0]=(void**)malloc(n3*n2*sizeof(void*)))==NULL) {

            free(p);

            return NULL;

    }

    if ((p[0][0]=(void*)malloc(n3*n2*n1*size))==NULL) {

            free(p[0]);

            free(p);

            return NULL;

    }

for (i3=0; i3<n3; i3++) {

            p[i3] = p[0]+n2*i3;

            for (i2=0; i2<n2; i2++)

                    p[i3][i2] = (char*)p[0][0]+size*n1*(i2+n2*i3);

    }

    return p;

}

[/codebox]

where size os sizeof(float) for example.

Suppoze I have small volume nx=2, ny=3, nz=4.

I allocate v = Alloc3Float(nz,nx,ny) - > calls Alloc3 above.

I initialize, just for example to :

[codebox]

for (int i = 0; i < ny; i++ )

{

    for (int j = 0; j < nx; j++ )

    {

           for (int k = 0; k < nz; k++ )

           {

                v  [/i][j][k] = 2+k;

           }

    }

}

[/codebox]

So on host I have:

[codebox]

for (int i = 0; i < ny; i++ )

{

        for (int j = 0; j < nx; j++ )

        {

           for (int k = 0; k < nz; k++ )

           {

           fprint (log,"value %.2f address %d ", v[/i][j][k], &v[/i][j][k]);

}

   }

   fprint (log,"\n ");

   fflush(log);

}

[/codebox]

Output:

[i]value 2.00 address 43869328 value 3.00 address 43869332 value 4.00 address 43869336 value 5.00 address 43869340 value 2.00 address 43869344 value 3.00 address 43869348 value 4.00 address 43869352 value 5.00 address 43869356

value 2.00 address 43869360 value 3.00 address 43869364 value 4.00 address 43869368 value 5.00 address 43869372 value 2.00 address 43869376 value 3.00 address 43869380 value 4.00 address 43869384 value 5.00 address 43869388

value 2.00 address 43869392 value 3.00 address 43869396 value 4.00 address 43869400 value 5.00 address 43869404 value 2.00 address 43869408 value 3.00 address 43869412 value 4.00 address 43869416 value 5.00 address 43869420

[/i]

In this case I see that this is a range of contiguous addresses. Let’s just say that I do not have any influence over how the 3D array is allocated.

However, as a side note, had the 3D array been allocated like this for example:

[codebox]

void alloc_init ()

{

h_a3d  = (float ***)malloc(WX * sizeof(float **));

for(int i = 0; i < WX; i++)

{

            h_a3d[/i] = (float **)malloc(HY * sizeof(float *));

            if (h_a3d[/i] == NULL){perror("Malloc1 failed!"); exit (1); }

            for(int j = 0; j < HY; j++)

            {

                    h_a3d[/i][j] = (float *)malloc(DZ * sizeof(float));

                    if (h_a3d[/i][j] == NULL){perror("Malloc2 failed!"); exit (1); }

            }

}

}

[/codebox]

addresses would not have been contiguous.

Regardless, what I need is a formula (code) that would allow me, using cudaMalloc3D(Array), cudaPitchedPtr, cudaMemcpy3D(), etc, copy 3D array that was allocated for me to device, make necessary computations and copy it back, so that it would “fall directly back” into addresses on host as they are referenced by host “ijk” indices. That is, so that I could use 3D array w/o remapping indices immediately after array is copied back from device.

I had some success with using slice, slicePitch, and using float3 type. I.e. I got correct results, but then I need to remap float3 type row by row to the original 3D ijk array.

That is float3 row[x/y/z].[x/y/z] reference different memory locations than float 3darray[i][j][k].

Needless to say, if I have to remap indices to values and copy values from cudaPitchedPtr to the original 3D array every time after call to CUDA kernel - hardly makes sense to use CUDA at all.

On this forum I have seen some incomplete code excerpts showing how to do this w/o use of any 3D CUDA calls.

But was anybody successful in 3D volume processing using CUDA 3D calls w/o the need to remap from 1D Pitched CUDA ptr back to the original indices/values?

Again, I have no control of how original 3D array is allocated. I need to take it, calculate new values - and return back the in same memory layout.

Any help will be greatly appreciated.

Thank You.

See here:
http://forums.nvidia.com/index.php?showtop…mp;#entry493821

I saw that post.

Like I said above, the solutions that I saw there do not use cuda3D () functions and data structures ( cudaMemcpy3DParms, cudaMalloc3DArray, cudaPitchedPtr, etc…) as recommended by different CUDA publications.

Despite being recommended I have yet to see a complete working example.

This is what I am looking for.

May be I missed something in the post you referring, but I did not see that there.

Sorry, I guess I haven’t thoroughly followed your post.

If you find anything, please do share.

I don’t know if this would help, but I gave up on trying to manage indices like you want to do. I do the following from now on:

Allocate memory on device:

device float phi[meshmax][meshmax][meshmax];
device float rho[meshmax][meshmax][meshmax];

From you host program :

call your function with :

function(    .............                 float (*rho1)[meshmax][meshmax],  float (*phi1)[meshmax][meshmax]     ..................)

You can set your device memory to zero using this:

    unsigned char *d_addr = NULL;
    cudaGetSymbolAddress( (void**)&d_addr,"phi");
    cudaMemset( d_addr, 0.0f  , sizeof(phi));

And even copy your input data with:

cudaMemcpyToSymbol(rho,rho1,sizeof(rho));

then call yout kernel
calc_kernel <<< gridsize,N_THREADS_PER_BLOCK >>> (…);

and copy back your output data with:

cudaMemcpyFromSymbol(phi1,phi,sizeof(phi));

In your kernel, you can work wtih ijk

calc_kernel()
{
phi[i][j][k] = rho[i][j][k] ;

}

I don’t know if this is slower than with the 3D cuda functions but it works and I don’t have to play with index conversion.