2D Arrays in device code

I have an algorithm which uses many 2d arrays and I execute this algorithm on device code only. Now using CudaMalloc() for memory allocation is a slow operation to allocate memory. Most Cuda functions for 2D array allocation works only on Host code. Is there any function or way to allocate 2d array inside a device function?

Thanks in advance,
Nimish