memory allocating in __device__ function How?

Can i allocate any memory during GPU calculations in device function???

device functions have the same restrictions as global functions (so all global memory has to be allocated from yoru host code).

it involves allocating very large values of memory for some issues :(
for example, i need to do some actions with large matrices and i don’t know how much large will a resultant matrix, i. e. i need constantly relocate additional memory blocks to a resultant matrix. do you know how resolve this problem?

either preallocate a very large amount of GPU memory or figure out how much memory you’re going to need in one kernel, return values to the host, allocate the appropriate amount of memory, and then launch another kernel.

The case with preallocating a very large amount of GPU memory is mismatching so as it is very inefficiently using of memory and for a very large matrices memory can be not enough.

Returning values and launching another kernel take a long time :( then implementing on CPU will more effective.

Preallocating is a good solution. Instead of issuing multiple cudaMalloc() calls for all your arrays (whose sizes I’m guessing may change), issue a single, very large cudaMalloc() call. Then you can do with the block of memory anything you wish, split it up any way you like, and allocate from it in device code (depending on your needs, you can even use global memory atomics to create a full, robust allocator).