Multiple GPU, device-memory

GameTimeHero · April 23, 2010, 5:00pm

Community,

I have a program that does this:

For( i = 0; i < 500; i++ )
{
Load in Data
Split Data into two 1-D arrays
Move ONE array to each of the two devices
Execute Kernal
Move ONE array from each device to HOST
}

I looked at the SimpleMultiGPU example and came across this problem:

For( i = 0; i < num_iterations; i++ )
{
…
for(int device_counter = 0; device_counter < num_gpus; device_counter++)
{
threadID[device_counter] = cutStartThread( (CUT_THREADROUTINE)ComputeDevice, (void *)(deviceData+device_counter) );
}
…
}

where my CUT_THREADROUTINE is…

static CUT_THREADPROC ComputeDevice(TGPUdeviceData *deviceData)
{
unsigned short *d_Input;

// Set Device
cutilSafeCall( cudaSetDevice(deviceData->device) );

// Allocate Memory
cutilSafeCall( cudaMalloc((void**)&d_Input, ...) );

// Copy to Device
cutilSafeCall( cudaMemcpy( d_Input, deviceData->h_data, ..., cudaMemcpyHostToDevice ) );

**<<Call Kernal1>>**

// Copy to Host
cutilSafeCall( cudaMemcpy( deviceData->h_data, d_Input, ..., cudaMemcpyDeviceToHost ) );

cudaFree(d_Input);

CUT_THREADEND;

}

This solution works, but there is MAJOR LAGG!!! By allocating and freeing “d_Input” every single time for each device * each iteration really swamps the processing time.

Ideally, I want to set up a global device memory pointer/location such that I only need to allocate the memory once and then rewrite it every time it loops. What is the best approach to do this? Or is my line of thinking incorrect? Keep in mind that I need a global device memory location for EACH device and it needs to be done before the num_iterations loop begins.

Thanks so much for your help!

BlahCuda · April 23, 2010, 6:18pm

Community,

I have a program that does this:

For( i = 0; i < 500; i++ )

{
Load in Data

Split Data into two 1-D arrays

Move ONE array to each of the two devices

   **Execute Kernal**

Move ONE array from each device to HOST
}

I looked at the SimpleMultiGPU example and came across this problem:

For( i = 0; i < num_iterations; i++ )

{
 ...

 for(int device_counter = 0; device_counter < num_gpus; device_counter++)

 {

      threadID[device_counter] = cutStartThread( (CUT_THREADROUTINE)ComputeDevice, (void *)(deviceData+device_counter) );

 }

 ...
}

where my CUT_THREADROUTINE is…

static CUT_THREADPROC ComputeDevice(TGPUdeviceData *deviceData)

{
unsigned short *d_Input;

// Set Device

cutilSafeCall( cudaSetDevice(deviceData->device) );

// Allocate Memory

cutilSafeCall( cudaMalloc((void**)&d_Input, ...) );



// Copy to Device

cutilSafeCall( cudaMemcpy( d_Input, deviceData->h_data, ..., cudaMemcpyHostToDevice ) );

**<<Call Kernal1>>**

// Copy to Host

cutilSafeCall( cudaMemcpy( deviceData->h_data, d_Input, ..., cudaMemcpyDeviceToHost ) );

cudaFree(d_Input);



CUT_THREADEND;
}

This solution works, but there is MAJOR LAGG!!! By allocating and freeing “d_Input” every single time for each device * each iteration really swamps the processing time.

Ideally, I want to set up a global device memory pointer/location such that I only need to allocate the memory once and then rewrite it every time it loops. What is the best approach to do this? Or is my line of thinking incorrect? Keep in mind that I need a global device memory location for EACH device and it needs to be done before the num_iterations loop begins.

Thanks so much for your help!

Well, if you are using multi-GPUs, you need multiple host threads given that the resources allocated to the GPU are valid within the context of the host thread. For multi-GPUs, I currently use pthreads.

Topic		Replies	Views
Simple multiGPU - Why is it failed Example to understand how multiGPU work CUDA Programming and Performance	8	4343	March 6, 2008
MultiGPUs newbie question Data transformation problem CUDA Programming and Performance	12	5152	March 18, 2008
Multi-GPU computing How to perform N kernels on N devices concurrently? CUDA Programming and Performance	3	978	August 4, 2009
Using multiple GPU devices from a single host thread CUDA Programming and Performance	1	853	November 7, 2010
Multiple GPU memory address problem help CUDA Programming and Performance	6	7775	November 17, 2009
Multithreading and CUDA CUDA Programming and Performance	6	9082	April 14, 2010
Using Multiple Devices CUDA Programming and Performance	3	3460	September 10, 2008
MultiGPU start help CUDA Programming and Performance	8	10522	August 10, 2010
CUDa programming for multi-GPU questions about multi-GPU memory allocation CUDA Programming and Performance	1	7324	September 13, 2009
My first test on CUDA and some questions sync, thread with CUDA CUDA Programming and Performance	5	3019	November 13, 2007

Multiple GPU, device-memory

Related topics