The short answer is that there is no “easy” example. Your “3D” array is really a two dimension array of pointers, with each pointer holding the address on a single row of data. Allocating and copying data to the device in that form will require iterative cudaMalloc() and cudaMemcpy() calls, each allocating and copying a single row of data. At the end of it all, your device kernels will have to read though two levels of pointer indirection to get to your data (which is very slow) and none of the 3D api functions you have been asking about will work with data in that form anyway.
All arrays in the 2D and 3D CUDA memory access functions are really flat, 1D spaces which are padded for alignment and optimal access performance by the GPU memory controller. You would be much better served using a 1D array of size (xyz) and an addressing in 1D like data[i + j*x + k*x*y] in column major order (or the equivalent row major order) on both the host and device.