For( i = 0; i < 500; i++ )
{
Load in Data
Split Data into two 1-D arrays
Move ONE array to each of the two devices Execute Kernal
Move ONE array from each device to HOST
}
I looked at the SimpleMultiGPU example and came across this problem:
This solution works, but there is MAJOR LAGG!!! By allocating and freeing “d_Input” every single time for each device * each iteration really swamps the processing time.
Ideally, I want to set up a global device memory pointer/location such that I only need to allocate the memory once and then rewrite it every time it loops. What is the best approach to do this? Or is my line of thinking incorrect? Keep in mind that I need a global device memory location for EACH device and it needs to be done before the num_iterations loop begins.
Well, if you are using multi-GPUs, you need multiple host threads given that the resources allocated to the GPU are valid within the context of the host thread. For multi-GPUs, I currently use pthreads.