Moving data to a CUDA subprogram There has to be a better way!

There is something that I am very confused about in the Kirk book “Programming Massively Parallel Processors”. In Chapter 7 which is the MRI case study, they talk about creating two kernels from one subprogram that manipulates some very large matrices. The logic is fine. In fact, I have a similar problem. Chapter 7 never discusses how they got the data into the device matrices!! They just assume it is in and go from there. Now, as I said I have a similar problem - good for me since I am still learning CUDA.

When I create cudaMalloc statements, followed by cudaMemcopy statements and finally cudafree statements for matrices that have a large number of elements and then run them- it takes a long time to carry out these statements.

I have a c program that runs for several seconds on an Intel 2.5 Ghz CPU. I isolated some bottleneck subroutines,which I believe are ripe to be rewritten in CUDA, I created a section of the above statements in the c program that calls one of these subprograms, because that is how its done. Before continuing I decided to just try running the program with these memory statements only and the original c subprogram that has not yet to been rewritten in CUDA.

Just getting the host_matrices created, allocated with memory and transferred to the GPU device and eventually freed takes an inordinate amount of time. A very inordinate amount of time! It increases the running time of the whole program by a factor of 10.

This is going the wrong way. There must be some way to get large amounts of data in a GPU subprogram in a timely manner. I am sure the logic is sound in the GPU subprogram, but what about just getting the data in that program. That is a whole different issue.


The PCI-E bus has a theoretical maximum speed of 8 GB/s IIRC (and you never actually see that!). If you do very little work on the data before transferring it back to the GPU then this becomes a limiting factor.

Don’t despair though - there are a few tricks you could potentially use.

  1. Zero-copy. If your device allows for it zero-copy skips the memcpy call entirely, instead copying the memory when it is required. You’re still going to be limited by the PCI-E speed, but if you only use each element only once it allows you to (effectively) be fully bound by the PCI-E speed.
  2. Pinned memory. If you allocate your host arrays as pinned memory the transfer speed is faster.
  3. Streams. If your device allows these guys let you copy a portion of your data to the GPU while concurrently computing. A bit like zero-copy but is useful if you use the element more than once it’s going to be more efficient.

Failing that, if PCI-E isn’t good enough for you, you have to find a way of cutting down on memory transfers. Perhaps you could port more of your application to the GPU? To be honest, applications which take ‘several seconds’ aren’t usually very good targets for CUDA optimisation…