There is something that I am very confused about in the Kirk book “Programming Massively Parallel Processors”. In Chapter 7 which is the MRI case study, they talk about creating two kernels from one subprogram that manipulates some very large matrices. The logic is fine. In fact, I have a similar problem. Chapter 7 never discusses how they got the data into the device matrices!! They just assume it is in and go from there. Now, as I said I have a similar problem - good for me since I am still learning CUDA.
When I create cudaMalloc statements, followed by cudaMemcopy statements and finally cudafree statements for matrices that have a large number of elements and then run them- it takes a long time to carry out these statements.
I have a c program that runs for several seconds on an Intel 2.5 Ghz CPU. I isolated some bottleneck subroutines,which I believe are ripe to be rewritten in CUDA, I created a section of the above statements in the c program that calls one of these subprograms, because that is how its done. Before continuing I decided to just try running the program with these memory statements only and the original c subprogram that has not yet to been rewritten in CUDA.
Just getting the host_matrices created, allocated with memory and transferred to the GPU device and eventually freed takes an inordinate amount of time. A very inordinate amount of time! It increases the running time of the whole program by a factor of 10.
This is going the wrong way. There must be some way to get large amounts of data in a GPU subprogram in a timely manner. I am sure the logic is sound in the GPU subprogram, but what about just getting the data in that program. That is a whole different issue.
Newport_j