I shall interpret your question especially the word “allocate” in a more library form… I think you mean how to distribute the work load. Let me know if my interpretation of your question is wrong and you actually want to know how to “allocate”. The answer to that is probably use malloc and such.
Anyway the basic idea to distribute that workload is to do the following:
or in case width is higher than threads extra blocks need to be used for width:
BlockDim.Z = Depth;
BlockDim.Y = Height;
BlockDim.X = Width div Threads;
Array[BlockIdx.Z][BlockIdx.Y][BlockIdx.X * Threads + Thread.X]
(untested, but that’s my theorie).
So the workload is divided up into blocks and threads.
Each block can execute on different processors.
Since each higher dimension ultimately comes down to a single 1d dimension, spreading the threads over the 1d dimension is probably ok.
This assumes the first dimension is the largest, otherwise thread indexes would probably have to be spread over higher indexes as well or so.
Other example suppose problem is small, than multiple blocks not needed.
For example 8x8x8 could simply be done via: