In this way, I need frequently transfer data between CPU and GPU. My data is large(30M) and I need the whole algorithm finish in a very short time(20ms).
Now, data transfer takes most of time. Any way to speed up data transfer or even ask CPU to access GPU data directly?
You can use pinned memory to increase memory copy times. Also you could partition you calculation into smaller pieces (which are still large enough to saturate the GPU) and then process one partition while you are transferring another.
If you need to transfer the full 30MB before and after each of the steps, and cannot subdivide your data so that you copy to and from the device in parallel, you would need a PCI bandwidth in excess of of 6*30MB/0.020s=9GB/s even with no time spent on processing the data. So you have to find ways to minimize the data copied and to (partially) overlap copying to and from the device and processing the data.