3D memory allocation

Hi guys,
I have a code which does some manipulations on some 3D arrays (example: array[i][j][k])
I now want to allocate theses arrays on the device and do manipulations on it to parallerize the process thus to decrease the time.

Does anyone know how am I suppose to perform this?

Thank you very much.

Hi guys,
I have a code which does some manipulations on some 3D arrays (example: array[i][j][k])
I now want to allocate theses arrays on the device and do manipulations on it to parallerize the process thus to decrease the time.

Does anyone know how am I suppose to perform this?

Thank you very much.

I shall interpret your question especially the word “allocate” in a more library form… I think you mean how to distribute the work load. Let me know if my interpretation of your question is wrong and you actually want to know how to “allocate”. The answer to that is probably use malloc and such.

Anyway the basic idea to distribute that workload is to do the following:

or in case width is higher than threads extra blocks need to be used for width:

BlockDim.Z = Depth;
BlockDim.Y = Height;

(untested, but that’s my theorie).

Each block can execute on different processors.

Since each higher dimension ultimately comes down to a single 1d dimension, spreading the threads over the 1d dimension is probably ok.

This assumes the first dimension is the largest, otherwise thread indexes would probably have to be spread over higher indexes as well or so.

Other example suppose problem is small, than multiple blocks not needed.

For example 8x8x8 could simply be done via:

I shall interpret your question especially the word “allocate” in a more library form… I think you mean how to distribute the work load. Let me know if my interpretation of your question is wrong and you actually want to know how to “allocate”. The answer to that is probably use malloc and such.

Anyway the basic idea to distribute that workload is to do the following:

or in case width is higher than threads extra blocks need to be used for width:

BlockDim.Z = Depth;
BlockDim.Y = Height;

(untested, but that’s my theorie).

Each block can execute on different processors.

Since each higher dimension ultimately comes down to a single 1d dimension, spreading the threads over the 1d dimension is probably ok.

This assumes the first dimension is the largest, otherwise thread indexes would probably have to be spread over higher indexes as well or so.

Other example suppose problem is small, than multiple blocks not needed.

For example 8x8x8 could simply be done via:

Thank you very much Skybuck, I am not that far yet, my situation is that, I now have a 3D array example:array[z][y] in the host memory, I want to copy these data to the device memory in a 3D form, I tried using the CUDAMalloc3DArray and cudaMemcpy3D, but things seems not really working, do you have any thoughts?

Thank you again for your kind help.

Thank you very much Skybuck, I am not that far yet, my situation is that, I now have a 3D array example:array[z][y] in the host memory, I want to copy these data to the device memory in a 3D form, I tried using the CUDAMalloc3DArray and cudaMemcpy3D, but things seems not really working, do you have any thoughts?

Thank you again for your kind help.

Allocate a 1 dimensional array which has the same size as the 3d dimensional array.

Use a memcpy for the 1d memory to copy it from host to device and later back again.

Then use Skybuck’s General Indexing formula’s to distribute the work load.

Those formula’s still untested but should work.

It also includes an example how to convert 6D to 1D and 1D back to 3D assuring maximum scalebility and flexibility.

I am not yet sure how the formula’s impact performance, but you could try it out and see how it works out for you.

Soon I will be able to test myself.

Allocate a 1 dimensional array which has the same size as the 3d dimensional array.

Use a memcpy for the 1d memory to copy it from host to device and later back again.

Then use Skybuck’s General Indexing formula’s to distribute the work load.

Those formula’s still untested but should work.

It also includes an example how to convert 6D to 1D and 1D back to 3D assuring maximum scalebility and flexibility.

I am not yet sure how the formula’s impact performance, but you could try it out and see how it works out for you.

Soon I will be able to test myself.