Problem with cudamemcopy

I’ve been programming in some time and do not stop Cuda meet the same problems. The improvements are said in the calculations are not true, because it does not take into account the data transfers.
In my case I fail to improve the transfer that I make. This is the transfer of 10 arrays of different sizes of elements doubles.
I made it through 10 cudamemcpy but the transfers take me over 80% of computation time.
I made a modification attempted to unify the arrays in one and utiliziar a shift, but the improvement does not even reach 8%.
The next option that I am asking is whether any success cudamemcopy2D memcopy style or memcopyArraytoArray that might improve the transeferencia.
Can anyone give me a light beam? or definitely have to assume that the use of transfer cuda is very bad.

I have been programming in CUDA and I keep stumbling upon the same problema. I cant seem to get the improvement mentioned on the sdk and manuals as they dont take into account the memory transfer times.
I am trying to transfer 10 different arrays of double type to the gpu. I have done this using 10 cudamemcpys which take me over 80% of the calculation time.
I made a modification trying to unify the arrays all in one, and using a offset, but i only get an 8% of improvement.
The next option that i am looking for is if there is another function similar to cudamemcpy2D or memcopyArraytoArray that can improve the transfer.
¿ Can anyone shed any light on This matter, or do i have to asume that cuda has such a great set back such as memory transfer?
This is an example of my code :

Call Kernel

A very similar thread can be found here:

To improve memory copy speed you can try using pinned memory and asynchronous transfers. The first one is just faster, the second lets you copy and run kernels at the same time.

I have read this thread already and i have the exact problem, but the information provided does not resolve my problems. The results provided by nvidia arent entirely true if they only put improvement on actual kernel execution, and not the total time. If i always have to allocate memory and transfer the data to the gpu, why do they not include the time it takes??? Because it reduces the results to almost nothing. Please someone prove me wrong and show me that cuda is not crippled by memory administration times.

It depends on how you use it. If you do one kernel execution per memory copy, then yes, it is rather held back by memory copy times.

HOWEVER - both the big CUDA projects I’ve worked on have been able to execute many kernels between memory copies. One of them was reasonably often run for several hours without a single memory copy (overall speedup of 2 orders of magnetude). The other has a memory copy overhead of 1-5% of the GPU runtime. Worth mentioning, but hardly very significant.

The quoted results seem to be assuming the second usage pattern. For me that makes more sence, though I can see it might not for everybody.

ok, my project is of the type with large data transfers. But can anyone tell me if it can be improved?

Other than what I said before about pinned & asynchronous (both of which could really help, though asynchronous depends on your kernel) the only other options I can think of are reducing memory size, or reducing copy instructions.

I am already using pinned memory, and i have both synchronos and asynchronos copies ued when needed, but i was wondering if there was any memcpy that was better or faster than the rest to perform what i want to do