Multiple GPU, device-memory

Community,

I have a program that does this:

For( i = 0; i < 500; i++ )
{
Load in Data
Split Data into two 1-D arrays
Move ONE array to each of the two devices
Execute Kernal
Move ONE array from each device to HOST
}

I looked at the SimpleMultiGPU example and came across this problem:

For( i = 0; i < num_iterations; i++ )
{

for(int device_counter = 0; device_counter < num_gpus; device_counter++)
{
threadID[device_counter] = cutStartThread( (CUT_THREADROUTINE)ComputeDevice, (void *)(deviceData+device_counter) );
}

}

where my CUT_THREADROUTINE is…

static CUT_THREADPROC ComputeDevice(TGPUdeviceData *deviceData)
{
unsigned short *d_Input;

// Set Device
cutilSafeCall( cudaSetDevice(deviceData->device) );

// Allocate Memory
cutilSafeCall( cudaMalloc((void**)&d_Input, ...) );

// Copy to Device
cutilSafeCall( cudaMemcpy( d_Input, deviceData->h_data, ..., cudaMemcpyHostToDevice ) );

**<<Call Kernal1>>**

// Copy to Host
cutilSafeCall( cudaMemcpy( deviceData->h_data, d_Input, ..., cudaMemcpyDeviceToHost ) );

cudaFree(d_Input);

CUT_THREADEND;

}

This solution works, but there is MAJOR LAGG!!! By allocating and freeing “d_Input” every single time for each device * each iteration really swamps the processing time.

Ideally, I want to set up a global device memory pointer/location such that I only need to allocate the memory once and then rewrite it every time it loops. What is the best approach to do this? Or is my line of thinking incorrect? Keep in mind that I need a global device memory location for EACH device and it needs to be done before the num_iterations loop begins.

Thanks so much for your help!

Well, if you are using multi-GPUs, you need multiple host threads given that the resources allocated to the GPU are valid within the context of the host thread. For multi-GPUs, I currently use pthreads.