How to add more "logic code" at device side

Hi,
I’am new user using CUDA to calculate calculation on matrix in my project. And other code executed in CPU (logic if/else, memory malloc, etc.).

The code in CPU is like below:
malloc(matrix1);
cuda_calc1(matrix1);
malloc(matrix2);
cuda_calc2(matrix2);

The problem in my usage is that: CUDA calculaiton is fast, but there’s many data transter between CPU and GPU, the data transfer take almost all the time.

So my question is:
1, Does it possible to do some “logic code like malloc in CUDA(in a thread?)”?
2, Dos some topics or blogs to show?
3, Or other solution?