Memcpy over 80% of time!

Good morning,
I am currently trying to optimize matlab code with CUDA via mex-files with cuda code. I have already obtained a slight improvement over just-Matlab code, but after launching the Cuda Visual Profiler, have discovered that over 80% of the time is spent on copying data to and from the gpu.

Profile_times.bmp (1.1 MB)

I know that using pinned memory can improve memory transfer speeds by roughly two, but cant seem to get it to work with matlab mex-files.
I am not calling any additional matlab functions, so I should be able to use pinned memory (in refferal to other posts). Has anyone managed to use pinned memory, and if so, could please post an example of pinned memory with mex-files?
It would be useful to see it used in mexfiles, as i collect data via a pointer directly from matlab, and this way there is no need to reserver host memory directly.
Any help would be great!!!
Thank you in advance,
David Lisin