How to add more "logic code" at device side

I’am new user using CUDA to calculate calculation on matrix in my project. And other code executed in CPU (logic if/else, memory malloc, etc.).

The code in CPU is like below:

The problem in my usage is that: CUDA calculaiton is fast, but there’s many data transter between CPU and GPU, the data transfer take almost all the time.

So my question is:
1, Does it possible to do some “logic code like malloc in CUDA(in a thread?)”?
2, Dos some topics or blogs to show?
3, Or other solution?