I have noticed that int he API, there are two functions - cudaHostAlloc and cudaMallocHost. Can someone explain the difference between these two?
Also - I have noticed that there are some threads talking about usage, and having quite a bit of overhead using these methods. I understand this can be amortized to be reduced by reusing the buffers.
What kind of overhead hit can this actually take? If I have say 3 images that I want to get to the GPU, which are quite large (70mb each), would it be best to use cudaMallocHost for 3 buffers, and read the images into those buffers directly? Presently, I am reading into a normal malloc() array, and then copying to a device array.
Last question - if I was going to use a single array cudaMallocHost, and I copied each image into it individually, is there a way to move the image to another array once it is on the device? This way, I can instantiate one instance of cudaMallocHost, and transfer all three images to device, and have them all existing concurrently on the device.